date()
## [1] "Fri Nov 18 20:01:12 2022"
This IODS2022 is an introductory course to open data science and the open tools commonly used in it.
Link to my GitHub repository: https://github.com/jmleppal/IODS-project
I went through the R for the Health Data Science and got a some grasp on elementary use of R. All in all, the book is an excellent introduction to this topic. I’m trying to understand the intestines of GitHub but at the moment it looks to me as a long path of trial and error. When it comes to R Markdown, working on this visual-mode seems to be easier now in the beginning.
date()
## [1] "Fri Nov 18 20:01:13 2022"
# READING IN THE ORIGINAL DATASET
# Read the data into memory
learning2014 <- read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt", sep="\t", header=TRUE)
# Look at the dimensions and structure of the data
dim(learning2014)
## [1] 183 60
# str(learning2014)
The dataset used in this exercise is based on an international survey of students’ approaches to learning in 2013-2015. Measures in the parts A, B and C come from the ASSIST (Approaches and Study Skills Inventory for Students) questionnaire and the part D is based on the SATS (Survey of Attitudes Toward Statistics) questionnaire. Some background and other variables have been omitted, but all 183 responses are included. So all in all, there are 60 variables and 183 observations included. All variables except the one on gender are integers. No values are missing.
# CREATING THE ANALYSIS DATASET
# Access the dplyr library
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Create an analysis dataset
deep_questions <- c("D03", "D11", "D19", "D27", "D07", "D14", "D22", "D30","D06", "D15", "D23", "D31")
surface_questions <- c("SU02","SU10","SU18","SU26", "SU05","SU13","SU21","SU29","SU08","SU16","SU24","SU32")
strategic_questions <- c("ST01","ST09","ST17","ST25","ST04","ST12","ST20","ST28")
learn2014an <- learning2014 %>% select(gender, Age, Points)
learn2014an$deep <- rowMeans(learning2014[, deep_questions])
learn2014an$surf <- rowMeans(learning2014[, surface_questions])
learn2014an$stra <- rowMeans(learning2014[, strategic_questions])
# create the column 'attitude' by scaling the column "Attitude"
learn2014an$attitude <- learning2014$Attitude / 10
# print out the column names of the data
colnames(learn2014an)
## [1] "gender" "Age" "Points" "deep" "surf" "stra" "attitude"
# change the names of certain columns
colnames(learn2014an)[2] <- "age"
colnames(learn2014an)[3] <- "points"
# select rows where points is greater than zero
learn2014an <- filter(learn2014an, points > 0)
# Look at the dimensions and structure of the analysis data
# dim(learn2014an)
str(learn2014an)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
In the analysis dataset, variables on gender, age and points are unmodified but the variable on attitude is modified by dividing the original one by 10, and new variables deep, surf and stra are formed by averaging the respective part B variables. Attitude measures the global attitude towards statistics. Deep is a measure on how inclined the participant is towards deep learning, surf is a measure on how superficially the participant approaches his or her studying, and stra is a measure on how organized the participant’s manner is in studying. Points are the points obtained in an exam in a course on Introduction to Social Statistics.
The value 0 in points actually means that the participant did not participate in the exam (n = 17, 9%). These participants are excluded from the analysis dataset. Thus the dataset consists of 7 variables and 166 observations.
# SAVING THE ANALYSIS DATASET AND CHECKING THAT READING ANEW WORKS
# Set the working directory for saving and data
setwd("C:/Users/yona2/Desktop/FT/IODS2022/RStudio/IODS-project/data")
# Access the tidyverse library
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ stringr 1.4.1
## ✔ tidyr 1.2.1 ✔ forcats 0.5.2
## ✔ readr 2.1.3
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
# Save the analysis dataset
write.csv(learn2014an,"learn2014an.csv", row.names = FALSE)
# Checking that reading of the saved file works
# learn2014an2 <- read.table("learn2014an.csv", sep=",", header=TRUE)
# str(learn2014an2)
# head(learn2014an2)
Saving and reading the saved file work as expected.
# EXPLORING THE ANALYSIS DATA
# Access the GGally, ggplot2 and cowplot libraries
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
# Get summary values
learn2014an %>% group_by(gender) %>% summarise(count= n())
## # A tibble: 2 × 2
## gender count
## <chr> <int>
## 1 F 110
## 2 M 56
summarise_at(learn2014an, vars(attitude:points), tibble::lst(mean, sd))
## attitude_mean stra_mean surf_mean deep_mean points_mean attitude_sd stra_sd
## 1 3.142771 3.121235 2.787149 3.679719 22.71687 0.7299079 0.7718318
## surf_sd deep_sd points_sd
## 1 0.5288405 0.5541369 5.894884
learn2014an %>% group_by(gender) %>% summarise_at(vars(age), tibble::lst(mean, sd, min, median, max))
## # A tibble: 2 × 6
## gender mean sd min median max
## <chr> <dbl> <dbl> <int> <dbl> <int>
## 1 F 24.9 7.36 17 22 53
## 2 M 26.8 8.43 19 24 55
learn2014an %>% group_by(gender) %>% summarise_at(vars(attitude), tibble::lst(mean, sd, min, median, max))
## # A tibble: 2 × 6
## gender mean sd min median max
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 F 2.99 0.725 1.4 2.95 5
## 2 M 3.44 0.646 1.7 3.4 4.8
learn2014an %>% group_by(gender) %>% summarise_at(vars(deep), tibble::lst(mean, sd, min, median, max))
## # A tibble: 2 × 6
## gender mean sd min median max
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 F 3.66 0.535 1.58 3.67 4.75
## 2 M 3.72 0.592 2.08 3.79 4.92
learn2014an %>% group_by(gender) %>% summarise_at(vars(surf), tibble::lst(mean, sd, min, median, max))
## # A tibble: 2 × 6
## gender mean sd min median max
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 F 2.83 0.458 1.83 2.83 4
## 2 M 2.70 0.642 1.58 2.62 4.33
learn2014an %>% group_by(gender) %>% summarise_at(vars(stra), tibble::lst(mean, sd, min, median, max))
## # A tibble: 2 × 6
## gender mean sd min median max
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 F 3.20 0.748 1.38 3.25 5
## 2 M 2.96 0.801 1.25 3 4.5
learn2014an %>% group_by(gender) %>% summarise_at(vars(points), tibble::lst(mean, sd, min, median, max))
## # A tibble: 2 × 6
## gender mean sd min median max
## <chr> <dbl> <dbl> <int> <dbl> <int>
## 1 F 22.3 5.83 7 23 33
## 2 M 23.5 6.01 9 23.5 33
# Create and draw a more advanced plot matrix with ggpairs()
p <- ggpairs(learn2014an[, c(1, 2, 7, 4, 5, 6, 3)], mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)), progress = FALSE)
p
# Initialize plot with data and aesthetic mapping
p1 <- ggplot(learn2014an, aes(x = attitude, y = points))
# Define the visualization type (points) + add a regression line
# + add a main title and draw the plot
p2 <- p1 + geom_point() + geom_smooth(method = "lm") + ggtitle("Student's attitude versus exam points")
p2
## `geom_smooth()` using formula 'y ~ x'
Out of 166 participants, 56 (34%) are men and 110 (66%) women. The age ranges from 17 to 55 years, the average age (sd) being 26.8 (8.4) years for men and 24.9 (7.4) years for women. Mean scores (sd) are 3.1 (0.7) for attitude, 3.7 (0.6) for deep, 2.8 (0.5) for surf, 3.1 (0.8) for stra and 22.7 (5.9) for points. The distribution of age is skewed to the right, otherwise the distributions are reasonably symmetrical and do not noteworthy differ between gender.
Attitude is significantly correlated with points (rho = 0.44), which association can be seen nicely in the respective plot. On the other hand, surf is slightly negatively (rho -0.14) and stra slightly positively (rho = 0.15) correlated with points, as expected. There is statistically significant negative correlation between surf and attitude (rho = -0.18), deep (rho = -0.32) and stra (rho = -0.16), as expected. Otherwise the associations are unremarkable.
# BUILDING A MULTIPLE REGRESSION MODEL
# Fit a linear model
my_model1 <- lm(points ~ age, data = learn2014an)
my_model2 <- lm(points ~ gender, data = learn2014an)
my_model3 <- lm(points ~ attitude, data = learn2014an)
my_model4 <- lm(points ~ deep, data = learn2014an)
my_model5 <- lm(points ~ surf, data = learn2014an)
my_model6 <- lm(points ~ stra, data = learn2014an)
my_model31 <- lm(points ~ attitude + stra + surf, data = learn2014an)
my_model32 <- lm(points ~ attitude + age + gender, data = learn2014an)
# Print out a summary of the models
summary(my_model1)
##
## Call:
## lm(formula = points ~ age, data = learn2014an)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.0360 -3.7531 0.0958 4.6762 10.8128
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.52150 1.57339 15.585 <2e-16 ***
## age -0.07074 0.05901 -1.199 0.232
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.887 on 164 degrees of freedom
## Multiple R-squared: 0.008684, Adjusted R-squared: 0.00264
## F-statistic: 1.437 on 1 and 164 DF, p-value: 0.2324
summary(my_model2)
##
## Call:
## lm(formula = points ~ gender, data = learn2014an)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.3273 -3.3273 0.5179 4.5179 10.6727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.3273 0.5613 39.776 <2e-16 ***
## genderM 1.1549 0.9664 1.195 0.234
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.887 on 164 degrees of freedom
## Multiple R-squared: 0.008632, Adjusted R-squared: 0.002587
## F-statistic: 1.428 on 1 and 164 DF, p-value: 0.2338
summary(my_model3)
##
## Call:
## lm(formula = points ~ attitude, data = learn2014an)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
summary(my_model4)
##
## Call:
## lm(formula = points ~ deep, data = learn2014an)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.6913 -3.6935 0.2862 4.9957 10.3537
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.1141 3.0908 7.478 4.31e-12 ***
## deep -0.1080 0.8306 -0.130 0.897
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.913 on 164 degrees of freedom
## Multiple R-squared: 0.000103, Adjusted R-squared: -0.005994
## F-statistic: 0.01689 on 1 and 164 DF, p-value: 0.8967
summary(my_model5)
##
## Call:
## lm(formula = points ~ surf, data = learn2014an)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.6539 -3.3744 0.3574 4.4734 10.2234
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.2017 2.4432 11.134 <2e-16 ***
## surf -1.6091 0.8613 -1.868 0.0635 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.851 on 164 degrees of freedom
## Multiple R-squared: 0.02084, Adjusted R-squared: 0.01487
## F-statistic: 3.49 on 1 and 164 DF, p-value: 0.06351
summary(my_model6)
##
## Call:
## lm(formula = points ~ stra, data = learn2014an)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.5581 -3.8198 0.1042 4.3024 10.1394
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.233 1.897 10.141 <2e-16 ***
## stra 1.116 0.590 1.892 0.0603 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.849 on 164 degrees of freedom
## Multiple R-squared: 0.02135, Adjusted R-squared: 0.01538
## F-statistic: 3.578 on 1 and 164 DF, p-value: 0.06031
summary(my_model31)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learn2014an)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
summary(my_model32)
##
## Call:
## lm(formula = points ~ attitude + age + gender, data = learn2014an)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.4590 -3.3221 0.2186 4.0247 10.4632
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.42910 2.29043 5.863 2.48e-08 ***
## attitude 3.60657 0.59322 6.080 8.34e-09 ***
## age -0.07586 0.05367 -1.414 0.159
## genderM -0.33054 0.91934 -0.360 0.720
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.315 on 162 degrees of freedom
## Multiple R-squared: 0.2018, Adjusted R-squared: 0.187
## F-statistic: 13.65 on 3 and 162 DF, p-value: 5.536e-08
summary(my_model3)
##
## Call:
## lm(formula = points ~ attitude, data = learn2014an)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
# Draw diagnostic plots using the plot() function, choosing the plots 1, 2 and 5
par(mfrow = c(1,3))
plot(my_model3, which = c(1, 2, 5))
In univariate regression models, only attitude has a statistically
significant association with points (p = 4.12e-09). Surf and stra are
marginally associated with points (p = 0.06 for both).
Attitude, surf and stra are selected as explanatory variables for a multiple regression model. In the model, attitude preserves its statistical significance while surf and stra do not add much to it. The overall fitness of the model do not improve: there is practically no change in multiple R-squared (0.21 in the multiple regression and 0.19 in the univariate model) and the overall p actually worsens (3.16e-08 in the multiple regression and 4.12e-09 in the univariate model). Likewise, adding demographic factors age and gender to the model do not improve it.
Therefore, for the final model, only attitude is selected as an explanatory variable. An increase of one unit in attitude is associated with an increase of 3.5 points in the exam results. Based on the multiple R-squared statistics, scores in attitude explained 19 % of the variation in points. Even though this “explanatory power” is rather small, the model was statistically highly significant (F = 38.6 with 1 and 164 df, p = 4.12e-09). This kind of result is rather typical for human behavioral data.
In a linear regression model, the association between the dependant and explanatory variable are assumed to be linear and monotonously increasing or decreasing. The included variables should be normally distributed and the explanatory variables uncorrelated with each other. The diagnostic plots did not show any noteworthy deviance; there were a couple of outliers that did not seem to distort the association between attitude and points. The distributions of both attitude and points were reasonably symmetrical and close enough to normal. In the scatter plot on points by attitude, the association seemed clearly monotonously increasing.
date()
## [1] "Fri Nov 18 20:01:27 2022"
# READING IN THE DATASET
# set the working directory
setwd("C:/Users/yona2/Desktop/FT/IODS2022/RStudio/IODS-project/data")
# access the tidyverse library
library(tidyverse)
# read the data
alc <- read.table("alc.csv", sep = ",", header = TRUE)
# check structure and dimensions
colnames(alc)
## [1] "X" "school" "sex" "age" "address"
## [6] "famsize" "Pstatus" "Medu" "Fedu" "Mjob"
## [11] "Fjob" "reason" "guardian" "traveltime" "studytime"
## [16] "schoolsup" "famsup" "activities" "nursery" "higher"
## [21] "internet" "romantic" "famrel" "freetime" "goout"
## [26] "Dalc" "Walc" "health" "failures" "paid"
## [31] "absences" "G1" "G2" "G3" "alc_use"
## [36] "high_use"
# glimpse(alc)
The data is based on two surveys concerning student achievement in secondary education in two Portuguese schools. The dataset contains factors related to performances in mathematics and Portuguese language. The variables failures, paid, absences, G1, G2 and G3 present a combined assessment on both subjects. The analyses focus on the relationship between alcohol consumption (alc_use; 1 very low to 5 very high) and other variables. Alcohol use is considered high if it exceeds 2 (high_use).
All in all, there are 36 variables and 370 observations included. No values are missing.
As for primary hypotheses, I would expect going out, failures and absences to be positively and final grade (G3) to be negatively associated with alcohol use.
# EXPLORING THE DATA
# access the GGally and ggplot2 libraries
library(GGally); library(ggplot2)
# set summary values
summary(alc[, c(4, 8:9, 14:15, 23:25, 28:29, 31, 34:35)])
## age Medu Fedu traveltime studytime
## Min. :15.00 Min. :0.0 Min. :0.000 Min. :1.000 Min. :1.000
## 1st Qu.:16.00 1st Qu.:2.0 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000
## Median :17.00 Median :3.0 Median :3.000 Median :1.000 Median :2.000
## Mean :16.58 Mean :2.8 Mean :2.557 Mean :1.446 Mean :2.043
## 3rd Qu.:17.00 3rd Qu.:4.0 3rd Qu.:3.750 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :22.00 Max. :4.0 Max. :4.000 Max. :4.000 Max. :4.000
## famrel freetime goout health
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:4.000 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:3.000
## Median :4.000 Median :3.000 Median :3.000 Median :4.000
## Mean :3.935 Mean :3.224 Mean :3.116 Mean :3.562
## 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## failures absences G3 alc_use
## Min. :0.0000 Min. : 0.000 Min. : 0.00 Min. :1.000
## 1st Qu.:0.0000 1st Qu.: 1.000 1st Qu.:10.00 1st Qu.:1.000
## Median :0.0000 Median : 3.000 Median :12.00 Median :1.500
## Mean :0.1892 Mean : 4.511 Mean :11.52 Mean :1.889
## 3rd Qu.:0.0000 3rd Qu.: 6.000 3rd Qu.:14.00 3rd Qu.:2.500
## Max. :3.0000 Max. :45.000 Max. :18.00 Max. :5.000
alc %>% group_by(sex) %>% summarise(count= n())
## # A tibble: 2 × 2
## sex count
## <chr> <int>
## 1 F 195
## 2 M 175
alc %>% group_by(high_use) %>% summarise(count= n())
## # A tibble: 2 × 2
## high_use count
## <lgl> <int>
## 1 FALSE 259
## 2 TRUE 111
alc %>% group_by(sex, high_use) %>% summarise(count= n())
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 × 3
## # Groups: sex [2]
## sex high_use count
## <chr> <lgl> <int>
## 1 F FALSE 154
## 2 F TRUE 41
## 3 M FALSE 105
## 4 M TRUE 70
p1 <- ggpairs(alc[, c(35, 4, 8:9, 14:15, 23:25)], mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)), progress = FALSE)
p2 <- ggpairs(alc[, c(35, 28:29, 31, 34)], mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)), progress = FALSE)
p3 <- ggpairs(alc[, c(36, 4, 8:9, 14:15, 23:25)], mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)), progress = FALSE)
p4 <- ggpairs(alc[, c(36, 28:29, 31, 34:35)], mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)), progress = FALSE)
p1; p2; p3; p4
# initialize plot with data and aesthetic mapping
p5 <- ggplot(alc, aes(x = goout, y = alc_use))
p6 <- ggplot(alc, aes(x = failures, y = alc_use))
p7 <- ggplot(alc, aes(x = absences, y = alc_use))
p8 <- ggplot(alc, aes(x = G3, y = alc_use))
# define the visualization type (points) + add a regression line
# + add a main title and draw the plot
p9 <- p5 + geom_point() + geom_smooth(method = "lm") + ggtitle("Student's going out versus alcohol use")
p10 <- p6 + geom_point() + geom_smooth(method = "lm") + ggtitle("Student's failures versus alcohol use")
p11 <- p7 + geom_point() + geom_smooth(method = "lm") + ggtitle("Student's absences versus alcohol use")
p12 <- p8 + geom_point() + geom_smooth(method = "lm") + ggtitle("Student's G3 versus alcohol use")
p9; p10; p11; p12
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
p13 <- ggpairs(alc[, c(3:4, 25, 29, 31, 34)], mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)), progress = FALSE)
p13;
Out of 370 participants, 175 (47%) are men and 195 (52%) women. The median age is 17 years, ranging from 15 to 22 years. 111 (30%) participants use alcohol in high quantities; 70 (40%) men and 41 (21%) women are high users. The distributions (median, range) of going out (3, 1-5) and final grade (3, 1-5) are reasonably symmetrical but failures (0, 0-3) and absences (3, 0-45) are highly skewed to the right.
As expected, in initial exploration, going out (rho = 0.40), failures (rho = 0.21) and absences (rho = 0.21) are positively and final grade (rho = -0.16) negatively associated with alcohol use. When alcohol use is dichotomized, these relationships seem to become attenuated. On the other hand, going out (rho = -0.15), failures (rho = -0.36) and absences (rho = -0.10) are negatively associated with final grade. Futhermore, going out is positively associated with failures (rho = 0.14) and absences (rho = 0.11).
# BUILDING A LOGISTIC REGRESSION MODEL
# find the model with glm()
m1 <- glm(high_use ~ age, data = alc, family = "binomial")
m2 <- glm(high_use ~ sex, data = alc, family = "binomial")
m3 <- glm(high_use ~ goout, data = alc, family = "binomial")
m4 <- glm(high_use ~ failures, data = alc, family = "binomial")
m5 <- glm(high_use ~ absences, data = alc, family = "binomial")
m6 <- glm(high_use ~ G3, data = alc, family = "binomial")
m7 <- glm(high_use ~ age + sex, data = alc, family = "binomial")
m8 <- glm(high_use ~ goout + failures + absences + G3, data = alc, family = "binomial")
m9 <- glm(high_use ~ age + sex + goout + failures + absences, data = alc, family = "binomial")
m10 <- glm(high_use ~ sex + goout + failures + absences, data = alc, family = "binomial")
# print out a summary of the model
summary(m1); summary(m2); summary(m3); summary(m4); summary(m5);
##
## Call:
## glm(formula = high_use ~ age, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0968 -0.8705 -0.8020 1.4319 1.6943
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.07538 1.60980 -2.532 0.0114 *
## age 0.19414 0.09628 2.016 0.0438 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 447.94 on 368 degrees of freedom
## AIC: 451.94
##
## Number of Fisher Scoring iterations: 4
##
## Call:
## glm(formula = high_use ~ sex, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0108 -1.0108 -0.6871 1.3537 1.7660
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.3234 0.1757 -7.530 5.06e-14 ***
## sexM 0.9179 0.2339 3.925 8.67e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 436.13 on 368 degrees of freedom
## AIC: 440.13
##
## Number of Fisher Scoring iterations: 4
##
## Call:
## glm(formula = high_use ~ goout, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3712 -0.7687 -0.5470 0.9952 2.3037
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.3368 0.4188 -7.968 1.62e-15 ***
## goout 0.7563 0.1165 6.494 8.34e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 403.05 on 368 degrees of freedom
## AIC: 407.05
##
## Number of Fisher Scoring iterations: 4
##
## Call:
## glm(formula = high_use ~ failures, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6367 -0.7943 -0.7943 1.3143 1.6171
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.9920 0.1234 -8.040 8.99e-16 ***
## failures 0.6759 0.1973 3.425 0.000615 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 439.75 on 368 degrees of freedom
## AIC: 443.75
##
## Number of Fisher Scoring iterations: 4
##
## Call:
## glm(formula = high_use ~ absences, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3598 -0.8209 -0.7318 1.2658 1.7419
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.26945 0.16094 -7.888 3.08e-15 ***
## absences 0.08867 0.02317 3.827 0.00013 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 434.52 on 368 degrees of freedom
## AIC: 438.52
##
## Number of Fisher Scoring iterations: 4
summary(m6); summary(m7); summary(m8); summary(m9); summary(m10);
##
## Call:
## glm(formula = high_use ~ G3, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2365 -0.8551 -0.7651 1.4213 1.7731
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.13783 0.40374 0.341 0.7328
## G3 -0.08689 0.03466 -2.507 0.0122 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 445.68 on 368 degrees of freedom
## AIC: 449.68
##
## Number of Fisher Scoring iterations: 4
##
## Call:
## glm(formula = high_use ~ age + sex, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3034 -0.8870 -0.7086 1.2275 1.9163
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.72614 1.65095 -2.863 0.0042 **
## age 0.20425 0.09812 2.082 0.0374 *
## sexM 0.93259 0.23552 3.960 7.51e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 431.73 on 367 degrees of freedom
## AIC: 437.73
##
## Number of Fisher Scoring iterations: 4
##
## Call:
## glm(formula = high_use ~ goout + failures + absences + G3, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0077 -0.7367 -0.5427 0.9191 2.3664
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.42137 0.69467 -4.925 8.43e-07 ***
## goout 0.69791 0.11881 5.874 4.25e-09 ***
## failures 0.52078 0.23720 2.195 0.028128 *
## absences 0.07373 0.02233 3.303 0.000958 ***
## G3 -0.01612 0.04171 -0.386 0.699253
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 383.91 on 365 degrees of freedom
## AIC: 393.91
##
## Number of Fisher Scoring iterations: 4
##
## Call:
## glm(formula = high_use ~ age + sex + goout + failures + absences,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1461 -0.7853 -0.5274 0.7258 2.3609
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.72724 1.89765 -2.491 0.012735 *
## age 0.03623 0.11375 0.318 0.750115
## sexM 0.98157 0.26166 3.751 0.000176 ***
## goout 0.69397 0.12157 5.708 1.14e-08 ***
## failures 0.47932 0.23331 2.054 0.039930 *
## absences 0.08171 0.02276 3.590 0.000331 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 369.40 on 364 degrees of freedom
## AIC: 381.4
##
## Number of Fisher Scoring iterations: 4
##
## Call:
## glm(formula = high_use ~ sex + goout + failures + absences, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1521 -0.7929 -0.5317 0.7412 2.3706
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.14389 0.47881 -8.654 < 2e-16 ***
## sexM 0.97989 0.26156 3.746 0.000179 ***
## goout 0.69801 0.12093 5.772 7.83e-09 ***
## failures 0.48932 0.23073 2.121 0.033938 *
## absences 0.08246 0.02266 3.639 0.000274 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 369.50 on 365 degrees of freedom
## AIC: 379.5
##
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m1); coef(m2); coef(m3); coef(m4); coef(m5);
## (Intercept) age
## -4.0753783 0.1941355
## (Intercept) sexM
## -1.3233805 0.9179154
## (Intercept) goout
## -3.3368302 0.7563474
## (Intercept) failures
## -0.9920231 0.6758788
## (Intercept) absences
## -1.26944532 0.08866504
coef(m6); coef(m7); coef(m8); coef(m9); coef(m10);
## (Intercept) G3
## 0.1378294 -0.0868889
## (Intercept) age sexM
## -4.7261439 0.2042451 0.9325859
## (Intercept) goout failures absences G3
## -3.42137467 0.69790726 0.52077970 0.07373480 -0.01611567
## (Intercept) age sexM goout failures absences
## -4.72723868 0.03622756 0.98157242 0.69396783 0.47932468 0.08171370
## (Intercept) sexM goout failures absences
## -4.14389330 0.97988506 0.69801208 0.48932427 0.08245946
# comparing models
m82 <- glm(high_use ~ goout + failures + absences, data = alc, family = "binomial")
anova(m8, m82, test="LRT")
## Analysis of Deviance Table
##
## Model 1: high_use ~ goout + failures + absences + G3
## Model 2: high_use ~ goout + failures + absences
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 365 383.91
## 2 366 384.06 -1 -0.14891 0.6996
anova(m9, m10, test="LRT")
## Analysis of Deviance Table
##
## Model 1: high_use ~ age + sex + goout + failures + absences
## Model 2: high_use ~ sex + goout + failures + absences
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 364 369.4
## 2 365 369.5 -1 -0.10135 0.7502
# compute odds ratios (OR) and confidence intervals (CI) for the final model
OR10 <- coef(m10) %>% exp %>% round(digits = 2)
CI10 <- confint(m10) %>% exp %>% round(digits = 2)
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR10, CI10)
## OR10 2.5 % 97.5 %
## (Intercept) 0.02 0.01 0.04
## sexM 2.66 1.61 4.49
## goout 2.01 1.59 2.56
## failures 1.63 1.04 2.59
## absences 1.09 1.04 1.14
In univariate logistic regression models, all selected explanatory variables as well as age and sex as demographic factors are statistically significantly associated with high alcohol use. In a multiple regression model, however, final grade and age are not any more significantly associated with high alcohol consumption. Likelihood ratio tests being insignificant also favor leaving them out of the final model. The point estimates of included explanatory variables in the final model are nearly the same as in the respective univariate models.
In the final model, one unit increase in going out increased the odds of using alcohol at high level by 101% (59%-156%), in failures by 63% (4%-159%) and in absences by 9% (4%-14%). Male sex increased the odds (95% CI) by 166% (61%-349%).
# PREDICTIVE ABILITY OF THE FINAL MODEL
## access dplyr and ggplot2
library(dplyr); library(ggplot2)
# fit the model
m10 <- glm(high_use ~ sex + goout + failures + absences, data = alc, family = "binomial")
# predict() the probability of high_use
probabilities <- predict(m10, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% addmargins %>% round(digits = 2)
## prediction
## high_use FALSE TRUE Sum
## FALSE 243 16 259
## TRUE 61 50 111
## Sum 304 66 370
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins %>% round(digits = 2)
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.66 0.04 0.70
## TRUE 0.16 0.14 0.30
## Sum 0.82 0.18 1.00
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = high_use, y = probability, col = prediction))
# define the geom as points and draw the plot
g + geom_point()
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = 0) %>% round(digits = 2)
## [1] 0.3
loss_func(class = alc$high_use, prob = 1) %>% round(digits = 2)
## [1] 0.7
loss_func(class = alc$high_use, prob = alc$probability) %>% round(digits = 2)
## [1] 0.21
The overall accuracy of the model (i.e., accurate classification) is high: 293 (79%) participants are correctly classified. There is not much difference in the prediction in between the user levels: out of those predicted to be low users of alcohol 243 (80%) are correctly classified whereas out of those predicted to be high users 61 (76%) are truly high users.
The model does increase the predictive performance compared to random guessing: 21% are inaccurately classified by the model, whereas 30% are inaccurately classified by randomly guessing that probability is 0 for all and 70 % by a randomly guessing that probability is 1 for all.
# BONUS SECTION: ON CROSS-VALIDATION
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m10, K = nrow(alc) / 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2108108
The final model does have a slightly better test set performance, error being 0.21 compared with the respective error produced by the model in the Exercise Set (0.26).
# SUPER-BONUS SECTION: ON CROSS-VALIDATION
# I'm sure there is a much more elegant way of writing this R code...
# different models and their summaries
m21 <- glm(high_use ~ age + sex + health + address + famsize + Pstatus + Fedu + Mjob + guardian + school + reason + schoolsup + famsup + paid + activities + nursery + higher + internet + romantic + famrel + studytime + traveltime + freetime + goout + failures + absences + G3, data = alc, family = "binomial")
summary(m21)
##
## Call:
## glm(formula = high_use ~ age + sex + health + address + famsize +
## Pstatus + Fedu + Mjob + guardian + school + reason + schoolsup +
## famsup + paid + activities + nursery + higher + internet +
## romantic + famrel + studytime + traveltime + freetime + goout +
## failures + absences + G3, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7733 -0.6942 -0.3703 0.5275 2.8698
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.06934 2.95252 -1.040 0.298541
## age 0.05401 0.14774 0.366 0.714696
## sexM 0.86707 0.33018 2.626 0.008639 **
## health 0.19978 0.11237 1.778 0.075433 .
## addressU -0.85388 0.41398 -2.063 0.039149 *
## famsizeLE3 0.50174 0.32064 1.565 0.117621
## PstatusT -0.21412 0.51220 -0.418 0.675910
## Fedu 0.04477 0.15749 0.284 0.776218
## Mjobhealth -0.86921 0.66777 -1.302 0.193033
## Mjobother -0.76813 0.46827 -1.640 0.100929
## Mjobservices -0.65342 0.51502 -1.269 0.204540
## Mjobteacher -0.36452 0.57503 -0.634 0.526139
## guardianmother -0.81573 0.36120 -2.258 0.023921 *
## guardianother -0.02361 0.77809 -0.030 0.975790
## schoolMS -0.41859 0.55098 -0.760 0.447418
## reasonhome 0.43518 0.38491 1.131 0.258223
## reasonother 1.33386 0.52127 2.559 0.010501 *
## reasonreputation -0.05167 0.40177 -0.129 0.897677
## schoolsupyes 0.07500 0.45298 0.166 0.868493
## famsupyes -0.15239 0.32301 -0.472 0.637080
## paidyes 0.64657 0.31745 2.037 0.041673 *
## activitiesyes -0.54344 0.30862 -1.761 0.078264 .
## nurseryyes -0.46789 0.36303 -1.289 0.197453
## higheryes -0.17828 0.71758 -0.248 0.803787
## internetyes 0.21525 0.44485 0.484 0.628467
## romanticyes -0.53828 0.32531 -1.655 0.097993 .
## famrel -0.52687 0.16561 -3.181 0.001465 **
## studytime -0.28025 0.20271 -1.382 0.166824
## traveltime 0.25226 0.22899 1.102 0.270621
## freetime 0.23073 0.16392 1.408 0.159248
## goout 0.87900 0.15327 5.735 9.75e-09 ***
## failures 0.29553 0.27415 1.078 0.281042
## absences 0.09199 0.02614 3.519 0.000433 ***
## G3 0.02725 0.05191 0.525 0.599637
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 318.90 on 336 degrees of freedom
## AIC: 386.9
##
## Number of Fisher Scoring iterations: 5
m22 <- glm(high_use ~ age + sex + health + address + famsize + Pstatus + Fedu + Mjob + guardian + school + reason + famsup + paid + activities + nursery + higher + internet + romantic + famrel + studytime + traveltime + freetime + goout + failures + absences + G3, data = alc, family = "binomial")
summary(m22)
##
## Call:
## glm(formula = high_use ~ age + sex + health + address + famsize +
## Pstatus + Fedu + Mjob + guardian + school + reason + famsup +
## paid + activities + nursery + higher + internet + romantic +
## famrel + studytime + traveltime + freetime + goout + failures +
## absences + G3, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7609 -0.6940 -0.3720 0.5267 2.8647
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.96925 2.89116 -1.027 0.304415
## age 0.04918 0.14482 0.340 0.734150
## sexM 0.85938 0.32682 2.630 0.008551 **
## health 0.19904 0.11228 1.773 0.076282 .
## addressU -0.84988 0.41350 -2.055 0.039849 *
## famsizeLE3 0.49997 0.32044 1.560 0.118693
## PstatusT -0.21693 0.51219 -0.424 0.671902
## Fedu 0.04554 0.15741 0.289 0.772346
## Mjobhealth -0.87549 0.66660 -1.313 0.189056
## Mjobother -0.77016 0.46819 -1.645 0.099975 .
## Mjobservices -0.65192 0.51482 -1.266 0.205401
## Mjobteacher -0.36879 0.57448 -0.642 0.520900
## guardianmother -0.81692 0.36120 -2.262 0.023718 *
## guardianother -0.02329 0.77775 -0.030 0.976110
## schoolMS -0.42351 0.55004 -0.770 0.441325
## reasonhome 0.43646 0.38496 1.134 0.256889
## reasonother 1.33457 0.52098 2.562 0.010417 *
## reasonreputation -0.05037 0.40179 -0.125 0.900245
## famsupyes -0.15014 0.32279 -0.465 0.641833
## paidyes 0.64340 0.31691 2.030 0.042334 *
## activitiesyes -0.53930 0.30764 -1.753 0.079599 .
## nurseryyes -0.46621 0.36269 -1.285 0.198650
## higheryes -0.17380 0.71753 -0.242 0.808610
## internetyes 0.21588 0.44490 0.485 0.627516
## romanticyes -0.53981 0.32514 -1.660 0.096873 .
## famrel -0.52685 0.16560 -3.181 0.001465 **
## studytime -0.27968 0.20274 -1.380 0.167738
## traveltime 0.25567 0.22830 1.120 0.262772
## freetime 0.23063 0.16394 1.407 0.159502
## goout 0.87846 0.15325 5.732 9.9e-09 ***
## failures 0.29549 0.27438 1.077 0.281504
## absences 0.09192 0.02612 3.519 0.000433 ***
## G3 0.02598 0.05130 0.506 0.612585
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 318.93 on 337 degrees of freedom
## AIC: 384.93
##
## Number of Fisher Scoring iterations: 5
m23 <- glm(high_use ~ age + sex + health + address + famsize + Pstatus + Fedu + Mjob + guardian + school + reason + famsup + paid + activities + nursery + internet + romantic + famrel + studytime + traveltime + freetime + goout + failures + absences + G3, data = alc, family = "binomial")
summary(m23)
##
## Call:
## glm(formula = high_use ~ age + sex + health + address + famsize +
## Pstatus + Fedu + Mjob + guardian + school + reason + famsup +
## paid + activities + nursery + internet + romantic + famrel +
## studytime + traveltime + freetime + goout + failures + absences +
## G3, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7885 -0.6919 -0.3695 0.5206 2.8581
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.19543 2.73368 -1.169 0.242440
## age 0.05587 0.14190 0.394 0.693793
## sexM 0.86745 0.32512 2.668 0.007629 **
## health 0.19870 0.11227 1.770 0.076750 .
## addressU -0.85349 0.41317 -2.066 0.038858 *
## famsizeLE3 0.49873 0.32026 1.557 0.119410
## PstatusT -0.21809 0.51176 -0.426 0.669994
## Fedu 0.04169 0.15666 0.266 0.790163
## Mjobhealth -0.88016 0.66551 -1.323 0.185988
## Mjobother -0.77519 0.46717 -1.659 0.097050 .
## Mjobservices -0.65521 0.51392 -1.275 0.202341
## Mjobteacher -0.37101 0.57428 -0.646 0.518248
## guardianmother -0.81986 0.36078 -2.272 0.023060 *
## guardianother -0.03054 0.77805 -0.039 0.968686
## schoolMS -0.43916 0.54672 -0.803 0.421826
## reasonhome 0.43014 0.38412 1.120 0.262790
## reasonother 1.33275 0.52008 2.563 0.010389 *
## reasonreputation -0.05181 0.40164 -0.129 0.897366
## famsupyes -0.15501 0.32206 -0.481 0.630290
## paidyes 0.63438 0.31470 2.016 0.043817 *
## activitiesyes -0.54909 0.30503 -1.800 0.071840 .
## nurseryyes -0.46197 0.36200 -1.276 0.201902
## internetyes 0.22716 0.44252 0.513 0.607727
## romanticyes -0.52965 0.32216 -1.644 0.100163
## famrel -0.52666 0.16560 -3.180 0.001471 **
## studytime -0.28176 0.20239 -1.392 0.163879
## traveltime 0.25472 0.22814 1.117 0.264205
## freetime 0.22989 0.16397 1.402 0.160909
## goout 0.87855 0.15327 5.732 9.93e-09 ***
## failures 0.30386 0.27304 1.113 0.265750
## absences 0.09209 0.02602 3.540 0.000401 ***
## G3 0.02346 0.05022 0.467 0.640372
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 318.98 on 338 degrees of freedom
## AIC: 382.98
##
## Number of Fisher Scoring iterations: 5
m24 <- glm(high_use ~ age + sex + health + address + famsize + Pstatus + Mjob + guardian + school + reason + famsup + paid + activities + nursery + internet + romantic + famrel + studytime + traveltime + freetime + goout + failures + absences + G3, data = alc, family = "binomial")
summary(m24)
##
## Call:
## glm(formula = high_use ~ age + sex + health + address + famsize +
## Pstatus + Mjob + guardian + school + reason + famsup + paid +
## activities + nursery + internet + romantic + famrel + studytime +
## traveltime + freetime + goout + failures + absences + G3,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8024 -0.6867 -0.3726 0.5270 2.8586
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.05049 2.67834 -1.139 0.254725
## age 0.05400 0.14173 0.381 0.703172
## sexM 0.86914 0.32518 2.673 0.007522 **
## health 0.19961 0.11211 1.781 0.074992 .
## addressU -0.85786 0.41243 -2.080 0.037526 *
## famsizeLE3 0.48881 0.31803 1.537 0.124296
## PstatusT -0.23584 0.50787 -0.464 0.642377
## Mjobhealth -0.86508 0.66285 -1.305 0.191865
## Mjobother -0.78793 0.46436 -1.697 0.089737 .
## Mjobservices -0.65306 0.51381 -1.271 0.203720
## Mjobteacher -0.33772 0.56006 -0.603 0.546508
## guardianmother -0.84113 0.35206 -2.389 0.016887 *
## guardianother -0.03532 0.77788 -0.045 0.963782
## schoolMS -0.43365 0.54688 -0.793 0.427804
## reasonhome 0.43122 0.38402 1.123 0.261480
## reasonother 1.33397 0.52063 2.562 0.010400 *
## reasonreputation -0.04724 0.40116 -0.118 0.906266
## famsupyes -0.14646 0.32032 -0.457 0.647509
## paidyes 0.63330 0.31460 2.013 0.044111 *
## activitiesyes -0.54591 0.30486 -1.791 0.073341 .
## nurseryyes -0.44785 0.35820 -1.250 0.211199
## internetyes 0.23102 0.44189 0.523 0.601110
## romanticyes -0.52520 0.32146 -1.634 0.102303
## famrel -0.52726 0.16571 -3.182 0.001463 **
## studytime -0.28577 0.20165 -1.417 0.156434
## traveltime 0.24871 0.22696 1.096 0.273141
## freetime 0.22620 0.16335 1.385 0.166121
## goout 0.88397 0.15201 5.815 6.06e-09 ***
## failures 0.28968 0.26763 1.082 0.279066
## absences 0.09278 0.02592 3.579 0.000345 ***
## G3 0.02455 0.05001 0.491 0.623489
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 319.05 on 339 degrees of freedom
## AIC: 381.05
##
## Number of Fisher Scoring iterations: 5
m25 <- glm(high_use ~ sex + health + address + famsize + Pstatus + Mjob + guardian + school + reason + famsup + paid + activities + nursery + internet + romantic + famrel + studytime + traveltime + freetime + goout + failures + absences + G3, data = alc, family = "binomial")
summary(m25)
##
## Call:
## glm(formula = high_use ~ sex + health + address + famsize + Pstatus +
## Mjob + guardian + school + reason + famsup + paid + activities +
## nursery + internet + romantic + famrel + studytime + traveltime +
## freetime + goout + failures + absences + G3, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8115 -0.6780 -0.3741 0.5351 2.8662
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.18345 1.40586 -1.553 0.120398
## sexM 0.87542 0.32423 2.700 0.006935 **
## health 0.19812 0.11194 1.770 0.076744 .
## addressU -0.87032 0.41091 -2.118 0.034173 *
## famsizeLE3 0.48549 0.31783 1.528 0.126636
## PstatusT -0.23136 0.50751 -0.456 0.648477
## Mjobhealth -0.86599 0.66254 -1.307 0.191187
## Mjobother -0.78347 0.46470 -1.686 0.091799 .
## Mjobservices -0.65478 0.51433 -1.273 0.202993
## Mjobteacher -0.33410 0.55988 -0.597 0.550689
## guardianmother -0.83727 0.35200 -2.379 0.017378 *
## guardianother 0.03273 0.75903 0.043 0.965609
## schoolMS -0.36884 0.52019 -0.709 0.478288
## reasonhome 0.42938 0.38416 1.118 0.263691
## reasonother 1.31806 0.51857 2.542 0.011030 *
## reasonreputation -0.04819 0.40091 -0.120 0.904328
## famsupyes -0.15410 0.31946 -0.482 0.629540
## paidyes 0.63749 0.31432 2.028 0.042544 *
## activitiesyes -0.54843 0.30457 -1.801 0.071751 .
## nurseryyes -0.43975 0.35738 -1.231 0.218509
## internetyes 0.22314 0.44134 0.506 0.613142
## romanticyes -0.50541 0.31711 -1.594 0.110986
## famrel -0.52404 0.16539 -3.169 0.001532 **
## studytime -0.28015 0.20040 -1.398 0.162138
## traveltime 0.24680 0.22724 1.086 0.277444
## freetime 0.22240 0.16299 1.365 0.172397
## goout 0.88971 0.15127 5.881 4.07e-09 ***
## failures 0.30230 0.26479 1.142 0.253582
## absences 0.09363 0.02576 3.635 0.000278 ***
## G3 0.02386 0.05003 0.477 0.633404
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 319.20 on 340 degrees of freedom
## AIC: 379.2
##
## Number of Fisher Scoring iterations: 5
m26 <- glm(high_use ~ sex + health + address + famsize + Mjob + guardian + school + reason + famsup + paid + activities + nursery + internet + romantic + famrel + studytime + traveltime + freetime + goout + failures + absences + G3, data = alc, family = "binomial")
summary(m26)
##
## Call:
## glm(formula = high_use ~ sex + health + address + famsize + Mjob +
## guardian + school + reason + famsup + paid + activities +
## nursery + internet + romantic + famrel + studytime + traveltime +
## freetime + goout + failures + absences + G3, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8257 -0.6871 -0.3759 0.5287 2.8577
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.39445 1.32839 -1.803 0.071465 .
## sexM 0.87214 0.32419 2.690 0.007141 **
## health 0.19659 0.11203 1.755 0.079284 .
## addressU -0.87215 0.41028 -2.126 0.033526 *
## famsizeLE3 0.49404 0.31668 1.560 0.118747
## Mjobhealth -0.85441 0.66232 -1.290 0.197039
## Mjobother -0.75819 0.46117 -1.644 0.100168
## Mjobservices -0.63041 0.51137 -1.233 0.217651
## Mjobteacher -0.31297 0.55700 -0.562 0.574195
## guardianmother -0.82015 0.34963 -2.346 0.018987 *
## guardianother 0.07088 0.76004 0.093 0.925699
## schoolMS -0.37703 0.51952 -0.726 0.468011
## reasonhome 0.42329 0.38397 1.102 0.270289
## reasonother 1.31565 0.51775 2.541 0.011051 *
## reasonreputation -0.04565 0.40123 -0.114 0.909410
## famsupyes -0.15297 0.31920 -0.479 0.631787
## paidyes 0.62587 0.31306 1.999 0.045585 *
## activitiesyes -0.55798 0.30363 -1.838 0.066104 .
## nurseryyes -0.42608 0.35608 -1.197 0.231472
## internetyes 0.20904 0.44029 0.475 0.634943
## romanticyes -0.50246 0.31704 -1.585 0.113007
## famrel -0.52594 0.16579 -3.172 0.001512 **
## studytime -0.28026 0.20035 -1.399 0.161869
## traveltime 0.24293 0.22631 1.073 0.283073
## freetime 0.21494 0.16181 1.328 0.184067
## goout 0.88814 0.15097 5.883 4.04e-09 ***
## failures 0.30053 0.26510 1.134 0.256940
## absences 0.09458 0.02555 3.702 0.000214 ***
## G3 0.02611 0.04981 0.524 0.600062
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 319.40 on 341 degrees of freedom
## AIC: 377.4
##
## Number of Fisher Scoring iterations: 5
m27 <- glm(high_use ~ sex + health + address + famsize + Mjob + guardian + school + reason + famsup + paid + activities + nursery + romantic + famrel + studytime + traveltime + freetime + goout + failures + absences + G3, data = alc, family = "binomial")
summary(m27)
##
## Call:
## glm(formula = high_use ~ sex + health + address + famsize + Mjob +
## guardian + school + reason + famsup + paid + activities +
## nursery + romantic + famrel + studytime + traveltime + freetime +
## goout + failures + absences + G3, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8328 -0.6796 -0.3747 0.5404 2.8683
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.32240 1.31885 -1.761 0.078251 .
## sexM 0.87340 0.32425 2.694 0.007068 **
## health 0.19373 0.11162 1.736 0.082649 .
## addressU -0.83448 0.40234 -2.074 0.038073 *
## famsizeLE3 0.49752 0.31629 1.573 0.115724
## Mjobhealth -0.79381 0.65218 -1.217 0.223542
## Mjobother -0.71259 0.45084 -1.581 0.113970
## Mjobservices -0.57330 0.49752 -1.152 0.249191
## Mjobteacher -0.25368 0.54375 -0.467 0.640826
## guardianmother -0.82837 0.34921 -2.372 0.017686 *
## guardianother 0.05555 0.75518 0.074 0.941360
## schoolMS -0.37018 0.51814 -0.714 0.474950
## reasonhome 0.41920 0.38397 1.092 0.274942
## reasonother 1.31310 0.51752 2.537 0.011172 *
## reasonreputation -0.04511 0.40044 -0.113 0.910317
## famsupyes -0.14499 0.31820 -0.456 0.648633
## paidyes 0.63651 0.31213 2.039 0.041427 *
## activitiesyes -0.54865 0.30311 -1.810 0.070282 .
## nurseryyes -0.44357 0.35358 -1.255 0.209647
## romanticyes -0.49066 0.31597 -1.553 0.120454
## famrel -0.52478 0.16579 -3.165 0.001549 **
## studytime -0.27793 0.20027 -1.388 0.165190
## traveltime 0.24304 0.22637 1.074 0.282979
## freetime 0.21735 0.16147 1.346 0.178275
## goout 0.89168 0.15095 5.907 3.48e-09 ***
## failures 0.29404 0.26526 1.109 0.267637
## absences 0.09631 0.02527 3.811 0.000139 ***
## G3 0.02666 0.04981 0.535 0.592451
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 319.63 on 342 degrees of freedom
## AIC: 375.63
##
## Number of Fisher Scoring iterations: 5
m28 <- glm(high_use ~ sex + health + address + famsize + Mjob + guardian + school + reason + paid + activities + nursery + romantic + famrel + studytime + traveltime + freetime + goout + failures + absences + G3, data = alc, family = "binomial")
summary(m28)
##
## Call:
## glm(formula = high_use ~ sex + health + address + famsize + Mjob +
## guardian + school + reason + paid + activities + nursery +
## romantic + famrel + studytime + traveltime + freetime + goout +
## failures + absences + G3, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8383 -0.6862 -0.3765 0.5306 2.8548
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.38141 1.31082 -1.817 0.069258 .
## sexM 0.89197 0.32200 2.770 0.005605 **
## health 0.18934 0.11099 1.706 0.088019 .
## addressU -0.82360 0.40107 -2.054 0.040024 *
## famsizeLE3 0.50629 0.31542 1.605 0.108462
## Mjobhealth -0.83338 0.64562 -1.291 0.196763
## Mjobother -0.70548 0.45022 -1.567 0.117122
## Mjobservices -0.58517 0.49658 -1.178 0.238640
## Mjobteacher -0.27326 0.54111 -0.505 0.613565
## guardianmother -0.82408 0.34927 -2.359 0.018302 *
## guardianother 0.03067 0.75544 0.041 0.967611
## schoolMS -0.33536 0.51293 -0.654 0.513234
## reasonhome 0.41271 0.38353 1.076 0.281888
## reasonother 1.32527 0.51650 2.566 0.010292 *
## reasonreputation -0.04696 0.39985 -0.117 0.906507
## paidyes 0.60780 0.30520 1.991 0.046429 *
## activitiesyes -0.54458 0.30288 -1.798 0.072176 .
## nurseryyes -0.44167 0.35364 -1.249 0.211699
## romanticyes -0.49720 0.31594 -1.574 0.115551
## famrel -0.52100 0.16548 -3.148 0.001642 **
## studytime -0.28997 0.19876 -1.459 0.144587
## traveltime 0.24073 0.22604 1.065 0.286887
## freetime 0.21119 0.16069 1.314 0.188762
## goout 0.89196 0.15073 5.918 3.27e-09 ***
## failures 0.29978 0.26516 1.131 0.258244
## absences 0.09569 0.02512 3.810 0.000139 ***
## G3 0.02776 0.04978 0.558 0.577025
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 319.84 on 343 degrees of freedom
## AIC: 373.84
##
## Number of Fisher Scoring iterations: 5
m29 <- glm(high_use ~ sex + health + address + famsize + Mjob + guardian + school + reason + paid + activities + nursery + romantic + famrel + studytime + traveltime + freetime + goout + failures + absences, data = alc, family = "binomial")
summary(m29)
##
## Call:
## glm(formula = high_use ~ sex + health + address + famsize + Mjob +
## guardian + school + reason + paid + activities + nursery +
## romantic + famrel + studytime + traveltime + freetime + goout +
## failures + absences, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8834 -0.6840 -0.3776 0.5414 2.8104
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.05408 1.16884 -1.757 0.078855 .
## sexM 0.88825 0.32061 2.771 0.005597 **
## health 0.18175 0.11005 1.652 0.098633 .
## addressU -0.80345 0.39831 -2.017 0.043681 *
## famsizeLE3 0.52116 0.31421 1.659 0.097196 .
## Mjobhealth -0.78493 0.63846 -1.229 0.218919
## Mjobother -0.69892 0.44936 -1.555 0.119860
## Mjobservices -0.54356 0.48975 -1.110 0.267057
## Mjobteacher -0.25304 0.53837 -0.470 0.638348
## guardianmother -0.82071 0.34918 -2.350 0.018755 *
## guardianother 0.01112 0.74944 0.015 0.988160
## schoolMS -0.35999 0.51109 -0.704 0.481214
## reasonhome 0.41977 0.38309 1.096 0.273185
## reasonother 1.32112 0.51711 2.555 0.010624 *
## reasonreputation -0.04785 0.39908 -0.120 0.904569
## paidyes 0.61105 0.30511 2.003 0.045206 *
## activitiesyes -0.53253 0.30174 -1.765 0.077585 .
## nurseryyes -0.44481 0.35308 -1.260 0.207738
## romanticyes -0.50890 0.31513 -1.615 0.106340
## famrel -0.51743 0.16515 -3.133 0.001730 **
## studytime -0.27759 0.19682 -1.410 0.158435
## traveltime 0.23358 0.22496 1.038 0.299133
## freetime 0.21474 0.16047 1.338 0.180840
## goout 0.87700 0.14758 5.943 2.81e-09 ***
## failures 0.25908 0.25368 1.021 0.307102
## absences 0.09469 0.02499 3.790 0.000151 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 320.15 on 344 degrees of freedom
## AIC: 372.15
##
## Number of Fisher Scoring iterations: 5
m30 <- glm(high_use ~ sex + health + address + famsize + Mjob + guardian + reason + paid + activities + nursery + romantic + famrel + studytime + traveltime + freetime + goout + failures + absences, data = alc, family = "binomial")
summary(m30)
##
## Call:
## glm(formula = high_use ~ sex + health + address + famsize + Mjob +
## guardian + reason + paid + activities + nursery + romantic +
## famrel + studytime + traveltime + freetime + goout + failures +
## absences, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8839 -0.6898 -0.3775 0.5367 2.8255
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.17730 1.15541 -1.884 0.059505 .
## sexM 0.89525 0.31999 2.798 0.005146 **
## health 0.19003 0.10953 1.735 0.082761 .
## addressU -0.72544 0.38358 -1.891 0.058593 .
## famsizeLE3 0.51249 0.31324 1.636 0.101822
## Mjobhealth -0.77546 0.63678 -1.218 0.223304
## Mjobother -0.70399 0.44936 -1.567 0.117195
## Mjobservices -0.54322 0.48904 -1.111 0.266656
## Mjobteacher -0.27495 0.53673 -0.512 0.608467
## guardianmother -0.79620 0.34737 -2.292 0.021900 *
## guardianother 0.02925 0.75311 0.039 0.969021
## reasonhome 0.40747 0.38255 1.065 0.286817
## reasonother 1.26791 0.51068 2.483 0.013035 *
## reasonreputation -0.02835 0.39735 -0.071 0.943126
## paidyes 0.60009 0.30427 1.972 0.048585 *
## activitiesyes -0.51400 0.30017 -1.712 0.086831 .
## nurseryyes -0.43167 0.35401 -1.219 0.222709
## romanticyes -0.52006 0.31407 -1.656 0.097747 .
## famrel -0.51113 0.16409 -3.115 0.001840 **
## studytime -0.26663 0.19567 -1.363 0.173005
## traveltime 0.20841 0.22109 0.943 0.345871
## freetime 0.20529 0.15996 1.283 0.199357
## goout 0.87369 0.14741 5.927 3.08e-09 ***
## failures 0.27010 0.25267 1.069 0.285069
## absences 0.09587 0.02486 3.857 0.000115 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 320.66 on 345 degrees of freedom
## AIC: 370.66
##
## Number of Fisher Scoring iterations: 5
m31 <- glm(high_use ~ sex + health + address + famsize + Mjob + guardian + reason + paid + activities + nursery + romantic + famrel + studytime + freetime + goout + failures + absences, data = alc, family = "binomial")
summary(m31)
##
## Call:
## glm(formula = high_use ~ sex + health + address + famsize + Mjob +
## guardian + reason + paid + activities + nursery + romantic +
## famrel + studytime + freetime + goout + failures + absences,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9201 -0.6693 -0.3788 0.6131 2.8134
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.73261 1.04989 -1.650 0.098888 .
## sexM 0.90570 0.32041 2.827 0.004704 **
## health 0.19158 0.10943 1.751 0.080000 .
## addressU -0.85025 0.35852 -2.372 0.017714 *
## famsizeLE3 0.54170 0.31175 1.738 0.082274 .
## Mjobhealth -0.82277 0.63458 -1.297 0.194785
## Mjobother -0.71149 0.44970 -1.582 0.113618
## Mjobservices -0.54651 0.48736 -1.121 0.262138
## Mjobteacher -0.30889 0.53811 -0.574 0.565951
## guardianmother -0.82702 0.34479 -2.399 0.016456 *
## guardianother 0.09127 0.74590 0.122 0.902610
## reasonhome 0.37498 0.38059 0.985 0.324487
## reasonother 1.21870 0.50695 2.404 0.016218 *
## reasonreputation -0.05788 0.39576 -0.146 0.883733
## paidyes 0.61034 0.30451 2.004 0.045036 *
## activitiesyes -0.51451 0.30001 -1.715 0.086355 .
## nurseryyes -0.43764 0.35353 -1.238 0.215742
## romanticyes -0.51168 0.31366 -1.631 0.102818
## famrel -0.50391 0.16334 -3.085 0.002035 **
## studytime -0.28187 0.19527 -1.444 0.148876
## freetime 0.19340 0.15912 1.215 0.224208
## goout 0.88148 0.14680 6.004 1.92e-09 ***
## failures 0.28102 0.25169 1.117 0.264191
## absences 0.09545 0.02474 3.859 0.000114 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 321.54 on 346 degrees of freedom
## AIC: 369.54
##
## Number of Fisher Scoring iterations: 5
m32 <- glm(high_use ~ sex + health + address + famsize + Mjob + guardian + reason + paid + activities + nursery + romantic + famrel + studytime + freetime + goout + absences, data = alc, family = "binomial")
summary(m32)
##
## Call:
## glm(formula = high_use ~ sex + health + address + famsize + Mjob +
## guardian + reason + paid + activities + nursery + romantic +
## famrel + studytime + freetime + goout + absences, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8364 -0.6916 -0.3742 0.6186 2.8205
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.62838 1.03909 -1.567 0.11709
## sexM 0.92409 0.31965 2.891 0.00384 **
## health 0.20762 0.10799 1.923 0.05453 .
## addressU -0.86632 0.35739 -2.424 0.01535 *
## famsizeLE3 0.54741 0.31159 1.757 0.07895 .
## Mjobhealth -0.91185 0.63436 -1.437 0.15060
## Mjobother -0.75304 0.44670 -1.686 0.09184 .
## Mjobservices -0.56842 0.48561 -1.171 0.24179
## Mjobteacher -0.41462 0.53144 -0.780 0.43528
## guardianmother -0.81293 0.34396 -2.363 0.01811 *
## guardianother 0.21191 0.73660 0.288 0.77359
## reasonhome 0.37154 0.37997 0.978 0.32817
## reasonother 1.21736 0.50496 2.411 0.01592 *
## reasonreputation -0.08370 0.39379 -0.213 0.83167
## paidyes 0.57695 0.30272 1.906 0.05666 .
## activitiesyes -0.52457 0.29905 -1.754 0.07941 .
## nurseryyes -0.42051 0.35095 -1.198 0.23084
## romanticyes -0.50262 0.31226 -1.610 0.10748
## famrel -0.52456 0.16165 -3.245 0.00117 **
## studytime -0.31888 0.19250 -1.657 0.09762 .
## freetime 0.19459 0.15863 1.227 0.21994
## goout 0.90675 0.14581 6.219 5.01e-10 ***
## absences 0.09679 0.02472 3.916 8.99e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 322.81 on 347 degrees of freedom
## AIC: 368.81
##
## Number of Fisher Scoring iterations: 5
m33 <- glm(high_use ~ sex + health + address + famsize + Mjob + guardian + reason + paid + activities + romantic + famrel + studytime + freetime + goout + absences, data = alc, family = "binomial")
summary(m33)
##
## Call:
## glm(formula = high_use ~ sex + health + address + famsize + Mjob +
## guardian + reason + paid + activities + romantic + famrel +
## studytime + freetime + goout + absences, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8594 -0.7167 -0.3836 0.6146 2.7734
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.81020 1.02798 -1.761 0.07825 .
## sexM 0.90662 0.31905 2.842 0.00449 **
## health 0.20825 0.10722 1.942 0.05210 .
## addressU -0.88950 0.35733 -2.489 0.01280 *
## famsizeLE3 0.49667 0.30835 1.611 0.10723
## Mjobhealth -0.99444 0.63260 -1.572 0.11595
## Mjobother -0.73015 0.44639 -1.636 0.10191
## Mjobservices -0.56579 0.48261 -1.172 0.24105
## Mjobteacher -0.45608 0.52878 -0.863 0.38840
## guardianmother -0.83220 0.34354 -2.422 0.01542 *
## guardianother 0.30055 0.73749 0.408 0.68362
## reasonhome 0.36564 0.37990 0.962 0.33581
## reasonother 1.20614 0.50090 2.408 0.01604 *
## reasonreputation -0.08088 0.39255 -0.206 0.83676
## paidyes 0.56829 0.30209 1.881 0.05995 .
## activitiesyes -0.50589 0.29824 -1.696 0.08984 .
## romanticyes -0.49832 0.31073 -1.604 0.10879
## famrel -0.52072 0.16068 -3.241 0.00119 **
## studytime -0.34536 0.19110 -1.807 0.07073 .
## freetime 0.18664 0.15763 1.184 0.23639
## goout 0.89963 0.14511 6.200 5.65e-10 ***
## absences 0.09505 0.02457 3.868 0.00011 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 324.23 on 348 degrees of freedom
## AIC: 368.23
##
## Number of Fisher Scoring iterations: 5
m34 <- glm(high_use ~ sex + health + address + famsize + Mjob + guardian + reason + paid + activities + romantic + famrel + studytime + goout + absences, data = alc, family = "binomial")
summary(m34)
##
## Call:
## glm(formula = high_use ~ sex + health + address + famsize + Mjob +
## guardian + reason + paid + activities + romantic + famrel +
## studytime + goout + absences, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8798 -0.7229 -0.3861 0.6079 2.7816
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.42394 0.96719 -1.472 0.14096
## sexM 0.94987 0.31747 2.992 0.00277 **
## health 0.20524 0.10683 1.921 0.05472 .
## addressU -0.86236 0.35600 -2.422 0.01542 *
## famsizeLE3 0.48070 0.30752 1.563 0.11802
## Mjobhealth -0.98267 0.62987 -1.560 0.11873
## Mjobother -0.69956 0.44378 -1.576 0.11494
## Mjobservices -0.52858 0.47796 -1.106 0.26877
## Mjobteacher -0.41155 0.52733 -0.780 0.43513
## guardianmother -0.84478 0.34325 -2.461 0.01385 *
## guardianother 0.37225 0.73463 0.507 0.61235
## reasonhome 0.30688 0.37404 0.820 0.41196
## reasonother 1.17332 0.49988 2.347 0.01891 *
## reasonreputation -0.11529 0.39057 -0.295 0.76786
## paidyes 0.56116 0.30094 1.865 0.06222 .
## activitiesyes -0.47737 0.29595 -1.613 0.10674
## romanticyes -0.46686 0.30805 -1.516 0.12964
## famrel -0.50323 0.15945 -3.156 0.00160 **
## studytime -0.35054 0.19051 -1.840 0.06576 .
## goout 0.94273 0.14165 6.655 2.83e-11 ***
## absences 0.09338 0.02452 3.807 0.00014 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 325.64 on 349 degrees of freedom
## AIC: 367.64
##
## Number of Fisher Scoring iterations: 5
m35 <- glm(high_use ~ sex + health + address + famsize + guardian + reason + paid + activities + romantic + famrel + studytime + goout + absences, data = alc, family = "binomial")
summary(m35)
##
## Call:
## glm(formula = high_use ~ sex + health + address + famsize + guardian +
## reason + paid + activities + romantic + famrel + studytime +
## goout + absences, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8674 -0.7171 -0.4029 0.5821 2.7665
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.72681 0.94633 -1.825 0.068041 .
## sexM 0.91034 0.30568 2.978 0.002901 **
## health 0.17681 0.10399 1.700 0.089096 .
## addressU -0.94771 0.34320 -2.761 0.005756 **
## famsizeLE3 0.45183 0.30426 1.485 0.137536
## guardianmother -0.75088 0.33290 -2.256 0.024096 *
## guardianother 0.40855 0.73349 0.557 0.577533
## reasonhome 0.20790 0.36585 0.568 0.569858
## reasonother 1.02210 0.48846 2.093 0.036393 *
## reasonreputation -0.27714 0.37752 -0.734 0.462877
## paidyes 0.54838 0.29252 1.875 0.060834 .
## activitiesyes -0.46361 0.28927 -1.603 0.109002
## romanticyes -0.45457 0.30736 -1.479 0.139160
## famrel -0.48800 0.15813 -3.086 0.002028 **
## studytime -0.32528 0.18830 -1.727 0.084081 .
## goout 0.90858 0.13835 6.567 5.13e-11 ***
## absences 0.09174 0.02432 3.773 0.000161 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 329.01 on 353 degrees of freedom
## AIC: 363.01
##
## Number of Fisher Scoring iterations: 5
m36 <- glm(high_use ~ sex + health + address + guardian + reason + paid + activities + romantic + famrel + studytime + goout + absences, data = alc, family = "binomial")
summary(m36)
##
## Call:
## glm(formula = high_use ~ sex + health + address + guardian +
## reason + paid + activities + romantic + famrel + studytime +
## goout + absences, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9142 -0.7135 -0.4238 0.6314 2.7314
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.57319 0.93900 -1.675 0.093858 .
## sexM 0.94921 0.30255 3.137 0.001705 **
## health 0.16840 0.10362 1.625 0.104131
## addressU -0.88181 0.33899 -2.601 0.009288 **
## guardianmother -0.69988 0.32862 -2.130 0.033194 *
## guardianother 0.40360 0.72237 0.559 0.576358
## reasonhome 0.17061 0.36325 0.470 0.638586
## reasonother 0.97289 0.48759 1.995 0.046011 *
## reasonreputation -0.28239 0.37686 -0.749 0.453655
## paidyes 0.54303 0.29151 1.863 0.062486 .
## activitiesyes -0.47048 0.28867 -1.630 0.103146
## romanticyes -0.41451 0.30400 -1.363 0.172726
## famrel -0.49000 0.15648 -3.131 0.001740 **
## studytime -0.33980 0.18695 -1.818 0.069132 .
## goout 0.89470 0.13681 6.540 6.17e-11 ***
## absences 0.09182 0.02445 3.755 0.000174 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 331.20 on 354 degrees of freedom
## AIC: 363.2
##
## Number of Fisher Scoring iterations: 5
m37 <- glm(high_use ~ sex + health + address + guardian + reason + paid + activities + famrel + studytime + goout + absences, data = alc, family = "binomial")
summary(m37)
##
## Call:
## glm(formula = high_use ~ sex + health + address + guardian +
## reason + paid + activities + famrel + studytime + goout +
## absences, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8365 -0.7167 -0.4191 0.6244 2.7675
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.64430 0.93738 -1.754 0.079404 .
## sexM 0.95963 0.30116 3.186 0.001440 **
## health 0.15499 0.10292 1.506 0.132074
## addressU -0.87733 0.33754 -2.599 0.009346 **
## guardianmother -0.69452 0.32728 -2.122 0.033826 *
## guardianother 0.34130 0.71444 0.478 0.632848
## reasonhome 0.17125 0.36156 0.474 0.635751
## reasonother 0.88750 0.48109 1.845 0.065069 .
## reasonreputation -0.26597 0.37622 -0.707 0.479601
## paidyes 0.55647 0.29067 1.914 0.055563 .
## activitiesyes -0.49094 0.28719 -1.709 0.087364 .
## famrel -0.47145 0.15487 -3.044 0.002334 **
## studytime -0.35932 0.18770 -1.914 0.055578 .
## goout 0.88596 0.13606 6.511 7.45e-11 ***
## absences 0.08737 0.02401 3.639 0.000274 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 333.10 on 355 degrees of freedom
## AIC: 363.1
##
## Number of Fisher Scoring iterations: 5
m38 <- glm(high_use ~ sex + address + guardian + reason + paid + activities + famrel + studytime + goout + absences, data = alc, family = "binomial")
summary(m38)
##
## Call:
## glm(formula = high_use ~ sex + address + guardian + reason +
## paid + activities + famrel + studytime + goout + absences,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9219 -0.7435 -0.4192 0.6018 2.8389
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.12869 0.86885 -1.299 0.193927
## sexM 1.00904 0.29874 3.378 0.000731 ***
## addressU -0.89571 0.33781 -2.652 0.008013 **
## guardianmother -0.70751 0.32584 -2.171 0.029904 *
## guardianother 0.29125 0.71045 0.410 0.681839
## reasonhome 0.14091 0.35838 0.393 0.694178
## reasonother 0.86090 0.47814 1.800 0.071783 .
## reasonreputation -0.33539 0.37241 -0.901 0.367793
## paidyes 0.52319 0.28763 1.819 0.068918 .
## activitiesyes -0.50255 0.28643 -1.755 0.079335 .
## famrel -0.43812 0.15217 -2.879 0.003987 **
## studytime -0.35475 0.18676 -1.899 0.057500 .
## goout 0.87536 0.13595 6.439 1.21e-10 ***
## absences 0.08705 0.02377 3.662 0.000251 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 335.41 on 356 degrees of freedom
## AIC: 363.41
##
## Number of Fisher Scoring iterations: 5
m39 <- glm(high_use ~ sex + address + guardian + paid + activities + famrel + studytime + goout + absences, data = alc, family = "binomial")
summary(m39)
##
## Call:
## glm(formula = high_use ~ sex + address + guardian + paid + activities +
## famrel + studytime + goout + absences, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8631 -0.7338 -0.4343 0.6525 2.7038
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.98780 0.84464 -1.169 0.242204
## sexM 1.01751 0.29384 3.463 0.000535 ***
## addressU -0.88774 0.32823 -2.705 0.006838 **
## guardianmother -0.68091 0.32127 -2.119 0.034056 *
## guardianother 0.27875 0.68783 0.405 0.685287
## paidyes 0.58954 0.27992 2.106 0.035195 *
## activitiesyes -0.52740 0.27932 -1.888 0.059003 .
## famrel -0.41406 0.14977 -2.765 0.005698 **
## studytime -0.42783 0.18304 -2.337 0.019422 *
## goout 0.84987 0.13236 6.421 1.35e-10 ***
## absences 0.08434 0.02276 3.705 0.000211 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 341.12 on 359 degrees of freedom
## AIC: 363.12
##
## Number of Fisher Scoring iterations: 5
m40 <- glm(high_use ~ sex + address + paid + activities + famrel + studytime + goout + absences, data = alc, family = "binomial")
summary(m40)
##
## Call:
## glm(formula = high_use ~ sex + address + paid + activities +
## famrel + studytime + goout + absences, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8947 -0.7350 -0.4209 0.6334 2.8292
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.50769 0.81175 -1.857 0.063263 .
## sexM 1.03295 0.29075 3.553 0.000381 ***
## addressU -0.80243 0.32151 -2.496 0.012566 *
## paidyes 0.55253 0.27681 1.996 0.045928 *
## activitiesyes -0.50399 0.27583 -1.827 0.067672 .
## famrel -0.39748 0.14678 -2.708 0.006769 **
## studytime -0.41410 0.18179 -2.278 0.022734 *
## goout 0.82299 0.12977 6.342 2.27e-10 ***
## absences 0.07971 0.02254 3.536 0.000406 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 347.06 on 361 degrees of freedom
## AIC: 365.06
##
## Number of Fisher Scoring iterations: 5
m41 <- glm(high_use ~ sex + address + paid + famrel + studytime + goout + absences, data = alc, family = "binomial")
summary(m41)
##
## Call:
## glm(formula = high_use ~ sex + address + paid + famrel + studytime +
## goout + absences, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9387 -0.7500 -0.4471 0.7226 2.7051
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.55596 0.80528 -1.932 0.053334 .
## sexM 0.94192 0.28379 3.319 0.000903 ***
## addressU -0.74838 0.31779 -2.355 0.018526 *
## paidyes 0.54137 0.27503 1.968 0.049027 *
## famrel -0.39952 0.14509 -2.754 0.005896 **
## studytime -0.46626 0.18048 -2.583 0.009782 **
## goout 0.80085 0.12757 6.278 3.43e-10 ***
## absences 0.07814 0.02253 3.468 0.000524 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 350.45 on 362 degrees of freedom
## AIC: 366.45
##
## Number of Fisher Scoring iterations: 5
m42 <- glm(high_use ~ sex + address + famrel + studytime + goout + absences, data = alc, family = "binomial")
summary(m42)
##
## Call:
## glm(formula = high_use ~ sex + address + famrel + studytime +
## goout + absences, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9143 -0.7377 -0.4386 0.7459 2.5829
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.32310 0.79014 -1.675 0.094030 .
## sexM 0.88170 0.27864 3.164 0.001555 **
## addressU -0.68822 0.31409 -2.191 0.028439 *
## famrel -0.40542 0.14396 -2.816 0.004860 **
## studytime -0.42030 0.17536 -2.397 0.016538 *
## goout 0.78974 0.12648 6.244 4.27e-10 ***
## absences 0.07432 0.02230 3.332 0.000861 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 354.40 on 363 degrees of freedom
## AIC: 368.4
##
## Number of Fisher Scoring iterations: 4
m43 <- glm(high_use ~ sex + famrel + studytime + goout + absences, data = alc, family = "binomial")
summary(m43)
##
## Call:
## glm(formula = high_use ~ sex + famrel + studytime + goout + absences,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7673 -0.7725 -0.4525 0.7122 2.6684
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.80705 0.75693 -2.387 0.016970 *
## sexM 0.90778 0.27500 3.301 0.000963 ***
## famrel -0.40238 0.14164 -2.841 0.004499 **
## studytime -0.41980 0.17376 -2.416 0.015693 *
## goout 0.76397 0.12465 6.129 8.85e-10 ***
## absences 0.07613 0.02242 3.396 0.000683 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 359.16 on 364 degrees of freedom
## AIC: 371.16
##
## Number of Fisher Scoring iterations: 4
m44 <- glm(high_use ~ sex + famrel + goout + absences, data = alc, family = "binomial")
summary(m44)
##
## Call:
## glm(formula = high_use ~ sex + famrel + goout + absences, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7082 -0.7909 -0.5005 0.7441 2.5673
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.71101 0.66639 -4.068 4.74e-05 ***
## sexM 1.07940 0.26449 4.081 4.48e-05 ***
## famrel -0.41726 0.14164 -2.946 0.003219 **
## goout 0.76974 0.12445 6.185 6.20e-10 ***
## absences 0.08218 0.02227 3.691 0.000224 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 365.31 on 365 degrees of freedom
## AIC: 375.31
##
## Number of Fisher Scoring iterations: 4
m45 <- glm(high_use ~ sex + goout + absences, data = alc, family = "binomial")
summary(m45)
##
## Call:
## glm(formula = high_use ~ sex + goout + absences, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8060 -0.8090 -0.5248 0.8214 2.4806
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.18142 0.48085 -8.696 < 2e-16 ***
## sexM 1.02223 0.25946 3.940 8.15e-05 ***
## goout 0.72793 0.12057 6.038 1.56e-09 ***
## absences 0.08478 0.02266 3.741 0.000183 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 374.10 on 366 degrees of freedom
## AIC: 382.1
##
## Number of Fisher Scoring iterations: 4
m46 <- glm(high_use ~ sex + goout, data = alc, family = "binomial")
summary(m46)
##
## Call:
## glm(formula = high_use ~ sex + goout, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5740 -0.8658 -0.6216 0.8272 2.4929
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.8193 0.4572 -8.354 < 2e-16 ***
## sexM 0.9268 0.2507 3.696 0.000219 ***
## goout 0.7578 0.1190 6.369 1.9e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 388.94 on 367 degrees of freedom
## AIC: 394.94
##
## Number of Fisher Scoring iterations: 4
m47 <- glm(high_use ~ goout, data = alc, family = "binomial")
summary(m47)
##
## Call:
## glm(formula = high_use ~ goout, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3712 -0.7687 -0.5470 0.9952 2.3037
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.3368 0.4188 -7.968 1.62e-15 ***
## goout 0.7563 0.1165 6.494 8.34e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 403.05 on 368 degrees of freedom
## AIC: 407.05
##
## Number of Fisher Scoring iterations: 4
# predict() the probabilities of high_use
prob21 <- predict(m21, type = "response")
prob22 <- predict(m22, type = "response")
prob23 <- predict(m23, type = "response")
prob24 <- predict(m24, type = "response")
prob25 <- predict(m25, type = "response")
prob26 <- predict(m26, type = "response")
prob27 <- predict(m27, type = "response")
prob28 <- predict(m28, type = "response")
prob29 <- predict(m29, type = "response")
prob30 <- predict(m30, type = "response")
prob31 <- predict(m31, type = "response")
prob32 <- predict(m32, type = "response")
prob33 <- predict(m33, type = "response")
prob34 <- predict(m34, type = "response")
prob35 <- predict(m35, type = "response")
prob36 <- predict(m36, type = "response")
prob37 <- predict(m37, type = "response")
prob38 <- predict(m38, type = "response")
prob39 <- predict(m39, type = "response")
prob40 <- predict(m40, type = "response")
prob41 <- predict(m41, type = "response")
prob42 <- predict(m42, type = "response")
prob43 <- predict(m43, type = "response")
prob44 <- predict(m44, type = "response")
prob45 <- predict(m45, type = "response")
prob46 <- predict(m46, type = "response")
prob47 <- predict(m47, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, prob21 = prob21)
alc <- mutate(alc, prob22 = prob22)
alc <- mutate(alc, prob23 = prob23)
alc <- mutate(alc, prob24 = prob24)
alc <- mutate(alc, prob25 = prob25)
alc <- mutate(alc, prob26 = prob26)
alc <- mutate(alc, prob27 = prob27)
alc <- mutate(alc, prob28 = prob28)
alc <- mutate(alc, prob29 = prob29)
alc <- mutate(alc, prob30 = prob30)
alc <- mutate(alc, prob31 = prob31)
alc <- mutate(alc, prob32 = prob32)
alc <- mutate(alc, prob33 = prob33)
alc <- mutate(alc, prob34 = prob34)
alc <- mutate(alc, prob35 = prob35)
alc <- mutate(alc, prob36 = prob36)
alc <- mutate(alc, prob37 = prob37)
alc <- mutate(alc, prob38 = prob38)
alc <- mutate(alc, prob39 = prob39)
alc <- mutate(alc, prob40 = prob40)
alc <- mutate(alc, prob41 = prob41)
alc <- mutate(alc, prob42 = prob42)
alc <- mutate(alc, prob43 = prob43)
alc <- mutate(alc, prob44 = prob44)
alc <- mutate(alc, prob45 = prob45)
alc <- mutate(alc, prob46 = prob46)
alc <- mutate(alc, prob47 = prob47)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, pred21 = prob21 > 0.5)
alc <- mutate(alc, pred22 = prob22 > 0.5)
alc <- mutate(alc, pred23 = prob23 > 0.5)
alc <- mutate(alc, pred24 = prob24 > 0.5)
alc <- mutate(alc, pred25 = prob25 > 0.5)
alc <- mutate(alc, pred26 = prob26 > 0.5)
alc <- mutate(alc, pred27 = prob27 > 0.5)
alc <- mutate(alc, pred28 = prob28 > 0.5)
alc <- mutate(alc, pred29 = prob29 > 0.5)
alc <- mutate(alc, pred30 = prob30 > 0.5)
alc <- mutate(alc, pred31 = prob31 > 0.5)
alc <- mutate(alc, pred32 = prob32 > 0.5)
alc <- mutate(alc, pred33 = prob33 > 0.5)
alc <- mutate(alc, pred34 = prob34 > 0.5)
alc <- mutate(alc, pred35 = prob35 > 0.5)
alc <- mutate(alc, pred36 = prob36 > 0.5)
alc <- mutate(alc, pred37 = prob37 > 0.5)
alc <- mutate(alc, pred38 = prob38 > 0.5)
alc <- mutate(alc, pred39 = prob39 > 0.5)
alc <- mutate(alc, pred40 = prob40 > 0.5)
alc <- mutate(alc, pred41 = prob41 > 0.5)
alc <- mutate(alc, pred42 = prob42 > 0.5)
alc <- mutate(alc, pred43 = prob43 > 0.5)
alc <- mutate(alc, pred44 = prob44 > 0.5)
alc <- mutate(alc, pred45 = prob45 > 0.5)
alc <- mutate(alc, pred46 = prob46 > 0.5)
alc <- mutate(alc, pred47 = prob47 > 0.5)
# print column names
colnames(alc)
## [1] "X" "school" "sex" "age" "address"
## [6] "famsize" "Pstatus" "Medu" "Fedu" "Mjob"
## [11] "Fjob" "reason" "guardian" "traveltime" "studytime"
## [16] "schoolsup" "famsup" "activities" "nursery" "higher"
## [21] "internet" "romantic" "famrel" "freetime" "goout"
## [26] "Dalc" "Walc" "health" "failures" "paid"
## [31] "absences" "G1" "G2" "G3" "alc_use"
## [36] "high_use" "probability" "prediction" "prob21" "prob22"
## [41] "prob23" "prob24" "prob25" "prob26" "prob27"
## [46] "prob28" "prob29" "prob30" "prob31" "prob32"
## [51] "prob33" "prob34" "prob35" "prob36" "prob37"
## [56] "prob38" "prob39" "prob40" "prob41" "prob42"
## [61] "prob43" "prob44" "prob45" "prob46" "prob47"
## [66] "pred21" "pred22" "pred23" "pred24" "pred25"
## [71] "pred26" "pred27" "pred28" "pred29" "pred30"
## [76] "pred31" "pred32" "pred33" "pred34" "pred35"
## [81] "pred36" "pred37" "pred38" "pred39" "pred40"
## [86] "pred41" "pred42" "pred43" "pred44" "pred45"
## [91] "pred46" "pred47"
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$prob21) %>% round(digits = 3)
## [1] 0.189
loss_func(class = alc$high_use, prob = alc$prob22) %>% round(digits = 3)
## [1] 0.186
loss_func(class = alc$high_use, prob = alc$prob23) %>% round(digits = 3)
## [1] 0.184
loss_func(class = alc$high_use, prob = alc$prob24) %>% round(digits = 3)
## [1] 0.181
loss_func(class = alc$high_use, prob = alc$prob25) %>% round(digits = 3)
## [1] 0.178
loss_func(class = alc$high_use, prob = alc$prob26) %>% round(digits = 3)
## [1] 0.186
loss_func(class = alc$high_use, prob = alc$prob27) %>% round(digits = 3)
## [1] 0.176
loss_func(class = alc$high_use, prob = alc$prob28) %>% round(digits = 3)
## [1] 0.181
loss_func(class = alc$high_use, prob = alc$prob29) %>% round(digits = 3)
## [1] 0.178
loss_func(class = alc$high_use, prob = alc$prob30) %>% round(digits = 3)
## [1] 0.184
loss_func(class = alc$high_use, prob = alc$prob31) %>% round(digits = 3)
## [1] 0.195
loss_func(class = alc$high_use, prob = alc$prob32) %>% round(digits = 3)
## [1] 0.2
loss_func(class = alc$high_use, prob = alc$prob33) %>% round(digits = 3)
## [1] 0.203
loss_func(class = alc$high_use, prob = alc$prob34) %>% round(digits = 3)
## [1] 0.186
loss_func(class = alc$high_use, prob = alc$prob35) %>% round(digits = 3)
## [1] 0.2
loss_func(class = alc$high_use, prob = alc$prob36) %>% round(digits = 3)
## [1] 0.181
loss_func(class = alc$high_use, prob = alc$prob37) %>% round(digits = 3)
## [1] 0.2
loss_func(class = alc$high_use, prob = alc$prob38) %>% round(digits = 3)
## [1] 0.2
loss_func(class = alc$high_use, prob = alc$prob39) %>% round(digits = 3)
## [1] 0.2
loss_func(class = alc$high_use, prob = alc$prob40) %>% round(digits = 3)
## [1] 0.203
loss_func(class = alc$high_use, prob = alc$prob41) %>% round(digits = 3)
## [1] 0.197
loss_func(class = alc$high_use, prob = alc$prob42) %>% round(digits = 3)
## [1] 0.211
loss_func(class = alc$high_use, prob = alc$prob43) %>% round(digits = 3)
## [1] 0.216
loss_func(class = alc$high_use, prob = alc$prob44) %>% round(digits = 3)
## [1] 0.2
loss_func(class = alc$high_use, prob = alc$prob45) %>% round(digits = 3)
## [1] 0.211
loss_func(class = alc$high_use, prob = alc$prob46) %>% round(digits = 3)
## [1] 0.214
loss_func(class = alc$high_use, prob = alc$prob47) %>% round(digits = 3)
## [1] 0.27
# K-fold cross-validation and average number of wrong predictions in the cross validation
library(boot)
cv21 <- cv.glm(data = alc, cost = loss_func, glmfit = m21, K = nrow(alc) / 10)
cv22 <- cv.glm(data = alc, cost = loss_func, glmfit = m22, K = nrow(alc) / 10)
cv23 <- cv.glm(data = alc, cost = loss_func, glmfit = m23, K = nrow(alc) / 10)
cv24 <- cv.glm(data = alc, cost = loss_func, glmfit = m24, K = nrow(alc) / 10)
cv25 <- cv.glm(data = alc, cost = loss_func, glmfit = m25, K = nrow(alc) / 10)
cv26 <- cv.glm(data = alc, cost = loss_func, glmfit = m26, K = nrow(alc) / 10)
cv27 <- cv.glm(data = alc, cost = loss_func, glmfit = m27, K = nrow(alc) / 10)
cv28 <- cv.glm(data = alc, cost = loss_func, glmfit = m28, K = nrow(alc) / 10)
cv29 <- cv.glm(data = alc, cost = loss_func, glmfit = m29, K = nrow(alc) / 10)
cv30 <- cv.glm(data = alc, cost = loss_func, glmfit = m30, K = nrow(alc) / 10)
cv31 <- cv.glm(data = alc, cost = loss_func, glmfit = m31, K = nrow(alc) / 10)
cv32 <- cv.glm(data = alc, cost = loss_func, glmfit = m32, K = nrow(alc) / 10)
cv33 <- cv.glm(data = alc, cost = loss_func, glmfit = m33, K = nrow(alc) / 10)
cv34 <- cv.glm(data = alc, cost = loss_func, glmfit = m34, K = nrow(alc) / 10)
cv35 <- cv.glm(data = alc, cost = loss_func, glmfit = m35, K = nrow(alc) / 10)
cv36 <- cv.glm(data = alc, cost = loss_func, glmfit = m36, K = nrow(alc) / 10)
cv37 <- cv.glm(data = alc, cost = loss_func, glmfit = m37, K = nrow(alc) / 10)
cv38 <- cv.glm(data = alc, cost = loss_func, glmfit = m38, K = nrow(alc) / 10)
cv39 <- cv.glm(data = alc, cost = loss_func, glmfit = m39, K = nrow(alc) / 10)
cv40 <- cv.glm(data = alc, cost = loss_func, glmfit = m40, K = nrow(alc) / 10)
cv41 <- cv.glm(data = alc, cost = loss_func, glmfit = m41, K = nrow(alc) / 10)
cv42 <- cv.glm(data = alc, cost = loss_func, glmfit = m42, K = nrow(alc) / 10)
cv43 <- cv.glm(data = alc, cost = loss_func, glmfit = m43, K = nrow(alc) / 10)
cv44 <- cv.glm(data = alc, cost = loss_func, glmfit = m44, K = nrow(alc) / 10)
cv45 <- cv.glm(data = alc, cost = loss_func, glmfit = m45, K = nrow(alc) / 10)
cv46 <- cv.glm(data = alc, cost = loss_func, glmfit = m46, K = nrow(alc) / 10)
cv47 <- cv.glm(data = alc, cost = loss_func, glmfit = m47, K = nrow(alc) / 10)
# average number of wrong predictions in the cross validation
cv21$delta[1] %>% round(digits = 3)
## [1] 0.257
cv22$delta[1] %>% round(digits = 3)
## [1] 0.249
cv23$delta[1] %>% round(digits = 3)
## [1] 0.251
cv24$delta[1] %>% round(digits = 3)
## [1] 0.257
cv25$delta[1] %>% round(digits = 3)
## [1] 0.235
cv26$delta[1] %>% round(digits = 3)
## [1] 0.243
cv27$delta[1] %>% round(digits = 3)
## [1] 0.227
cv28$delta[1] %>% round(digits = 3)
## [1] 0.227
cv29$delta[1] %>% round(digits = 3)
## [1] 0.235
cv30$delta[1] %>% round(digits = 3)
## [1] 0.23
cv31$delta[1] %>% round(digits = 3)
## [1] 0.246
cv32$delta[1] %>% round(digits = 3)
## [1] 0.232
cv33$delta[1] %>% round(digits = 3)
## [1] 0.227
cv34$delta[1] %>% round(digits = 3)
## [1] 0.211
cv35$delta[1] %>% round(digits = 3)
## [1] 0.222
cv36$delta[1] %>% round(digits = 3)
## [1] 0.224
cv37$delta[1] %>% round(digits = 3)
## [1] 0.216
cv38$delta[1] %>% round(digits = 3)
## [1] 0.227
cv39$delta[1] %>% round(digits = 3)
## [1] 0.224
cv40$delta[1] %>% round(digits = 3)
## [1] 0.214
cv41$delta[1] %>% round(digits = 3)
## [1] 0.205
cv42$delta[1] %>% round(digits = 3)
## [1] 0.227
cv43$delta[1] %>% round(digits = 3)
## [1] 0.23
cv44$delta[1] %>% round(digits = 3)
## [1] 0.205
cv45$delta[1] %>% round(digits = 3)
## [1] 0.216
cv46$delta[1] %>% round(digits = 3)
## [1] 0.241
cv47$delta[1] %>% round(digits = 3)
## [1] 0.27
# create a new dataset
err <- data.frame(npredictor = c(1:27),
train = c(0.189, 0.186, 0.184, 0.181, 0.178, 0.186, 0.176, 0.181, 0.178, 0.184, 0.195, 0.200, 0.203, 0.186, 0.200, 0.181, 0.200, 0.200, 0.200, 0.203, 0.197, 0.211, 0.216, 0.200, 0.211, 0.214, 0.270),
test = c(0.259, 0.235, 0.249, 0.235, 0.241, 0.251, 0.232, 0.23, 0.227, 0.227, 0.232, 0.230, 0.227, 0.216, 0.219, 0.219, 0.227, 0.222, 0.222, 0.224, 0.211, 0.224, 0.230, 0.216, 0.219, 0.243, 0.270))
# change the format of dataset
err_long <- err %>% pivot_longer(cols = c("train", "test"), names_to = "model", values_to = "ploss")
# print dimensions and column names
dim(err); colnames(err)
## [1] 27 3
## [1] "npredictor" "train" "test"
dim(err_long); colnames(err_long)
## [1] 54 3
## [1] "npredictor" "model" "ploss"
# access the ggplot2 library
library(ggplot2)
# Initialize plots with data and aesthetic mapping
ep1 <- ggplot(err_long, aes(x = npredictor, y = ploss, color = model))
# Define the visualization type (points) + add a regression line
# + add a main title and draw the plot
ep2 <- ep1 + geom_point() + geom_smooth(method = "glm", formula = "y ~ poly(x, 2)") + ggtitle("Predictive loss in training and testing")
ep2
# new data for prediction
errpred <- data.frame(npredictor = c(1:27))
# fits the model
predm1 <- lm(train ~ npredictor^2, data = err)
predm2 <- lm(test ~ npredictor^2, data = err)
predm1 <- glm(formula = "train ~ poly(npredictor, 2)", data = err)
predm2 <- glm(formula = "test ~ poly(npredictor, 2)", data = err)
# predicts the values with confidence interval
predict(predm1, newdata = errpred, interval = 'confidence') %>% round(digits = 3)
## 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0.187 0.186 0.185 0.184 0.183 0.183 0.183 0.183 0.184 0.184 0.185 0.186 0.188
## 14 15 16 17 18 19 20 21 22 23 24 25 26
## 0.189 0.191 0.193 0.196 0.198 0.201 0.204 0.207 0.211 0.215 0.219 0.223 0.227
## 27
## 0.232
predict(predm2, newdata = errpred, interval = 'confidence') %>% round(digits = 3)
## 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0.256 0.251 0.247 0.243 0.239 0.236 0.233 0.230 0.228 0.226 0.224 0.223 0.222
## 14 15 16 17 18 19 20 21 22 23 24 25 26
## 0.221 0.221 0.221 0.221 0.222 0.223 0.224 0.225 0.227 0.230 0.232 0.235 0.238
## 27
## 0.242
predict(predm1, newdata = errpred, interval = 'confidence') %>% round(digits = 3) %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1830 0.1845 0.1890 0.1966 0.2055 0.2320
predict(predm2, newdata = errpred, interval = 'confidence') %>% round(digits = 3) %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2210 0.2230 0.2280 0.2311 0.2370 0.2560
# summary values for dataset err
summary(err)
## npredictor train test
## Min. : 1.0 Min. :0.1760 Min. :0.211
## 1st Qu.: 7.5 1st Qu.:0.1840 1st Qu.:0.222
## Median :14.0 Median :0.1970 Median :0.227
## Mean :14.0 Mean :0.1967 Mean :0.231
## 3rd Qu.:20.5 3rd Qu.:0.2015 3rd Qu.:0.235
## Max. :27.0 Max. :0.2700 Max. :0.270
In order to compare different logistic regression models, I chose to start with a model with 27 explanatory variables, i.e. nearly all variables that were available. I built 27 different models, reducing the model by one variable at a time, deleting the variable with the highest p-value. In the model with seven explanatory variables - sex, address, paid, family relations, study time, going out and absences - all are statistically significant (p < 0.05).
The training errors varied from 0.176 to 0.270 and testing errors from 0.211 to 0.270. In the plot, it can be seen that the relationships between the number of the explanatory variables and errors are not linear but more towards 2-degree polynomial. In training, a model with 7 explanatory variables has the smallest error (0.176); in testing, the minimal error (0.211) is with a model with 21 explanatory variables. On the other hand, when these errors are modeled, the smallest error comes with 5-8 explanatory variables in training (0.183) and with 14-17 explanatory variables in testing (0.221). All in all, the errors varied little, implying that the phenomena (high use of alcohol) is indeed multifaceted by nature. In addition, the categorization of some variables does not seem to be statistically optimal and some ordinal variables might be better modeled as categorical ones.
date()
## [1] "Fri Nov 18 20:02:33 2022"
# LOADING THE DATA
# access the MASS package
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# load the data
data("Boston")
# define subset of data for boxplots
box <- dplyr::select(Boston, -tax, -black)
# explore the dataset
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
# plot distributions and matrix of the variables
boxplot(box)
boxplot(Boston$tax)
boxplot(Boston$blac)
p <- pairs(Boston)
p
## NULL
The Boston data contain information from 506 towns on factors such as crime rate, residential land, nitrogen oxides concentration, rooms per dwelling, tax rate, pupil-teacher ratio, population status and value of owner-occupied homes. Altogether 14 variables are included. No values are missing.
Variable distributions are mostly skewed. Median age of population is 77.5 years (range 2.9 to 100 years). Crime rate varies from 0.006 to 89.0, tax rate from 187 to 711 per 10,000 $, pupil-teacher ratio from 12.6 to 22.0, proportion of lower population status from 1.7% to 38.0% and nitrogen oxides concentration 0.39 to 0.87 parts per 10 million between towns.
# EXPLORING THE DATA
# access the tidyr and corrplot libraries
library(tidyr)
library(corrplot)
## corrplot 0.92 loaded
# calculate the correlation matrix and round it
cor_matrix <- cor(Boston) %>% round(digits = 2)
# print the correlation matrix
cor_matrix
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## black lstat medv
## crim -0.39 0.46 -0.39
## zn 0.18 -0.41 0.36
## indus -0.36 0.60 -0.48
## chas 0.05 -0.05 0.18
## nox -0.38 0.59 -0.43
## rm 0.13 -0.61 0.70
## age -0.27 0.60 -0.38
## dis 0.29 -0.50 0.25
## rad -0.44 0.49 -0.38
## tax -0.44 0.54 -0.47
## ptratio -0.18 0.37 -0.51
## black 1.00 -0.37 0.33
## lstat -0.37 1.00 -0.74
## medv 0.33 -0.74 1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper", tl.pos = "d", tl.cex = 0.6)
The variables are mostly correlated with each other, except for the
Charles River dummy variable describing if the tract bounds river or not
which is correlated only mildly with the median value of owner-occupied
homes (rho = 0.18). There are remarkably high correlations between age
and nitrogen oxides concentration (rho = 0.73) and mean distance to main
employment centres (rho = -0.75), between non-retail business acres and
nitrogen oxides concentration (rho = 0.76), mean distance to main
employment centres (rho = -0.71) and tax rate (rho = 0.72), between
average number of rooms per dwelling and median value of owner-occupied
homes (rho = 0.70), between accessibility to radial highways and tax
rate (rho = 0.91), and between proportion of lower population status and
median value of owner-occupied homes (rho = -0.74).
# SCALING THE DATASET AND CREATING 'CRIME' VARIABLE
# accessing library dplyr
library(dplyr)
# center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# plot distributions
boxplot(boston_scaled)
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"
# change the object to data frame
boston_scaled <- scale(Boston) %>% as.data.frame
# summary of the scaled crime rate
summary(boston_scaled$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE)
# look at the table of the new factor crime
table(crime)
## crime
## [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
Scaling removes most of the skewness in variables.
# LINEAR DISCRIMINANT ANALYSIS
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
## 0.2500000 0.2351485 0.2574257 0.2574257
##
## Group means:
## zn indus chas nox rm
## [-0.419,-0.411] 1.02173785 -0.8994928 -0.07742312 -0.8993438 0.44371484
## (-0.411,-0.39] -0.09908927 -0.2760698 -0.02367011 -0.5532177 -0.15740049
## (-0.39,0.00739] -0.41220520 0.2158375 0.18195173 0.3697296 0.02572831
## (0.00739,9.92] -0.48724019 1.0170690 -0.12090214 1.0401149 -0.43851128
## age dis rad tax ptratio
## [-0.419,-0.411] -0.8973034 0.8906287 -0.6941859 -0.7312167 -0.47063537
## (-0.411,-0.39] -0.2969235 0.3663729 -0.5478716 -0.4510348 -0.05336559
## (-0.39,0.00739] 0.4633028 -0.3947533 -0.4131592 -0.2925436 -0.21263412
## (0.00739,9.92] 0.8140138 -0.8452125 1.6386213 1.5144083 0.78135074
## black lstat medv
## [-0.419,-0.411] 0.37814722 -0.78268502 0.52748692
## (-0.411,-0.39] 0.34380870 -0.07818535 -0.02084936
## (-0.39,0.00739] 0.09560218 0.03615524 0.10223551
## (0.00739,9.92] -0.86865503 0.89479993 -0.76864899
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.07786455 0.826267099 -1.055546312
## indus 0.03581329 -0.146171680 0.149694214
## chas -0.08822401 -0.094807466 0.001126426
## nox 0.35488694 -0.764042995 -1.240672538
## rm -0.12605299 -0.003112984 -0.120153519
## age 0.30576672 -0.413983634 -0.284743661
## dis -0.04929312 -0.416153601 0.360571915
## rad 3.26161745 0.972155354 -0.145680349
## tax -0.08199507 -0.123880188 0.752694157
## ptratio 0.10642558 -0.005923155 -0.272584724
## black -0.13015372 0.021996640 0.162636230
## lstat 0.14330588 -0.125468330 0.571132827
## medv 0.12705818 -0.378783422 -0.091925529
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9456 0.0409 0.0135
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)
In linear discriminant analysis, the highest crime class almost solely
occupies one cluster and the other crime classes are mixed in the other
cluster. The most influential line separators seem to be accessibility
to radial highways, proportion of residential land zoned for lots over
2323 m2 and nitrogen oxides concentration, most likely implying
differences between rural and urban environments. The first discriminant
function separates 94.7% of the population.
# PREDICTING WITH THE LDA MODEL
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class) %>% addmargins
## predicted
## correct [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
## [-0.419,-0.411] 11 13 2 0
## (-0.411,-0.39] 4 18 9 0
## (-0.39,0.00739] 1 5 15 1
## (0.00739,9.92] 0 0 0 23
## Sum 16 36 26 24
## predicted
## correct Sum
## [-0.419,-0.411] 26
## (-0.411,-0.39] 31
## (-0.39,0.00739] 22
## (0.00739,9.92] 23
## Sum 102
The LDA model works reasonably well: it correctly predicts the crime class in 69 (68%) towns.
# DISTANCES AND CLUSTERING
# access library ggplot2
library(ggplot2)
# scale dataset
boston_scaled2 <- data.frame(scale(Boston))
# euclidean distance matrix
dist_eu <- dist(boston_scaled2)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled2, method = "manhattan")
# look at the summary of the distances
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
# k-means clustering
km <- kmeans(boston_scaled2, centers = 3)
# plot the Boston dataset with clusters
pairs(boston_scaled2, col = km$cluster)
pairs(boston_scaled2[1:7], col = km$cluster)
pairs(boston_scaled2[8:14], col = km$cluster)
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
# k-means clustering
km <- kmeans(boston_scaled2, centers = 2)
# plot the Boston dataset with clusters
pairs(boston_scaled2, col = km$cluster)
pairs(boston_scaled2[1:7], col = km$cluster)
pairs(boston_scaled2[8:14], col = km$cluster)
Scaling reduces the distances and skewness of their distribution. For
k-means clustering, 2 centers seem to be the best choice. One cluster
seem to be associated with low crime rate, low proportion of non-retail
business acres, low nitrogen oxides concentration, younger age, high
mean distance to main employment centres, low accessibility to radial
highways, low tax rate, high difference in race proportions and high
median value of owner-occupied homes. This may refer to that clustering
identifies rural areas or otherwise less densely populated areas
vs. urban areas.
# BONUS SECTION: MORE ON K-MEANS CLUSTERING AND LINEAR DISCRIMINANT ANALYSIS
# k-means clustering
km2 <- kmeans(boston_scaled2, centers = 6)
# plot the Boston dataset with clusters
pairs(boston_scaled2, col = km2$cluster)
pairs(boston_scaled2[1:7], col = km2$cluster)
pairs(boston_scaled2[8:14], col = km2$cluster)
# linear discriminant analysis
lda.fit2 <- lda(km2$cluster ~ ., data = boston_scaled2)
# print the lda.fit object
lda.fit2
## Call:
## lda(km2$cluster ~ ., data = boston_scaled2)
##
## Prior probabilities of groups:
## 1 2 3 4 5 6
## 0.06719368 0.12252964 0.24505929 0.18972332 0.29051383 0.08498024
##
## Group means:
## crim zn indus chas nox rm age
## 1 -0.1985497 -0.2602436 0.2799956 3.6647712 0.3830784 0.2756681 0.3721322
## 2 -0.4141953 2.3322773 -1.1788641 -0.2088275 -1.1887944 0.7188485 -1.4216721
## 3 1.1156495 -0.4872402 1.0149946 -0.2723291 0.9916473 -0.4276828 0.7515952
## 4 -0.3269211 -0.4816572 0.6406213 -0.2723291 0.4664828 -0.5082323 0.7668344
## 5 -0.3977957 -0.1563279 -0.5990216 -0.2723291 -0.6696819 -0.1686625 -0.6751063
## 6 -0.3732407 -0.1422288 -0.8310017 -0.2723291 -0.2005297 1.6901170 0.1841367
## dis rad tax ptratio black lstat
## 1 -0.4033382 0.001081444 -0.0975633 -0.39245849 0.17154271 -0.1643525
## 2 1.6223203 -0.666968948 -0.5535972 -0.82951564 0.35193833 -0.9665363
## 3 -0.8170870 1.659602895 1.5294129 0.80577843 -0.81154619 0.9129958
## 4 -0.5732656 -0.602637816 -0.1649468 0.21059409 0.04824716 0.5397714
## 5 0.5672190 -0.575610755 -0.6877857 -0.05330275 0.36267528 -0.3956902
## 6 -0.3232389 -0.511800977 -0.8155263 -1.10522083 0.34963036 -0.9616241
## medv
## 1 0.573340910
## 2 0.799806470
## 3 -0.771340259
## 4 -0.505310108
## 5 0.007601824
## 6 1.719927960
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3 LD4 LD5
## crim 0.034508803 0.05797085 -0.16103219 0.152322830 0.01702637
## zn 0.263383770 -0.23309900 -1.54525759 -1.099430119 0.78886901
## indus -0.073596625 0.35511069 0.27621202 -0.741302164 -0.03049072
## chas -5.833329077 -0.28810784 -0.25366217 -0.153849858 -0.20101832
## nox 0.012019761 -0.33200265 0.08267993 -0.138256930 0.38821434
## rm 0.077523647 0.02092389 -0.05279584 0.346603115 0.54639625
## age -0.078503080 0.10642631 0.50741330 -0.163867301 0.79431423
## dis -0.055475885 -0.36793105 -0.33307268 -0.006313437 -0.37843684
## rad -0.319867249 3.41804068 -1.06679562 1.629076215 -0.94341422
## tax 0.011360375 -0.30248228 -0.48448257 -0.924010759 0.59988910
## ptratio -0.010141142 -0.03367258 0.16740977 -0.504498820 0.12595753
## black -0.007393166 -0.08886704 0.01924355 0.045161817 -0.07202500
## lstat 0.123902259 0.13351602 0.01047765 0.007966329 0.40940453
## medv 0.203273034 -0.38745314 -0.07192725 0.516887596 0.76079378
##
## Proportion of trace:
## LD1 LD2 LD3 LD4 LD5
## 0.6292 0.2589 0.0730 0.0230 0.0159
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# plot the lda results
plot(lda.fit2, dimen = 2, col = km2$cluster, pch = km2$cluster)
lda.arrows(lda.fit2, myscale = 2)
Linear discriminant analysis produces three clusters, in which one
cluster is occupied by k-means cluster 1, another by k-means cluster 3,
and the third cluster contains the rest k-means clusters. The most
influential line separators seem to be accessibility to radial highways
and Charles River dummy variable. The first discriminant function
separates 62.9% and the second one 25.9% of the towns.
# SUPER-BONUS SECTION: 3D PLOTTING
# access library plotly
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
# define a new object
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
# k-means clustering
km3 <- kmeans(matrix_product, centers = 2)
km4 <- km3$cluster
# Create 3D plot
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km4)
The first matrix plot shows two clusters that are well separated from each other. In the second plot, one cluster is mainly occupied by the highest crime class. In the third plot, the clusters coincide fully with the k-means clustering with two centers.